Overview
Web scraping lets you extract structured data from websites. Browserbase provides a reliable browser infrastructure that helps you build scrapers that can:
Scale without infrastructure management
Maintain consistent performance
Avoid bot detection and CAPTCHAs with Browserbase’s stealth mode
Provide debugging and monitoring tools with session replays and live views
This guide will help you get started with web scraping on Browserbase and highlight best practices.
Scraping a website
Using a sample website, we’ll scrape the title, price, and some other details of books from the website.
Follow Along: Web Scraping Example Step-by-step code for web scraping
Code Example
import { Stagehand } from "@browserbasehq/stagehand" ;
import { z } from "zod" ;
import dotenv from "dotenv" ;
dotenv . config ();
const stagehand = new Stagehand ({
env: "BROWSERBASE" ,
verbose: 0 ,
});
async function scrapeBooks () {
await stagehand . init ();
const page = stagehand . page ;
await page . goto ( "https://books.toscrape.com/" );
const scrape = await page . extract ({
instruction: "Extract the books from the page" ,
schema: z . object ({
books: z . array ( z . object ({
title: z . string (),
price: z . string (),
image: z . string (),
inStock: z . string (),
link: z . string (),
}))
}),
});
console . log ( scrape . books );
await stagehand . close ();
return books ;
}
const books = scrapeBooks (). catch ( console . error );
import { Stagehand } from "@browserbasehq/stagehand" ;
import { z } from "zod" ;
import dotenv from "dotenv" ;
dotenv . config ();
const stagehand = new Stagehand ({
env: "BROWSERBASE" ,
verbose: 0 ,
});
async function scrapeBooks () {
await stagehand . init ();
const page = stagehand . page ;
await page . goto ( "https://books.toscrape.com/" );
const scrape = await page . extract ({
instruction: "Extract the books from the page" ,
schema: z . object ({
books: z . array ( z . object ({
title: z . string (),
price: z . string (),
image: z . string (),
inStock: z . string (),
link: z . string (),
}))
}),
});
console . log ( scrape . books );
await stagehand . close ();
return books ;
}
const books = scrapeBooks (). catch ( console . error );
import os
from playwright.sync_api import sync_playwright
from browserbase import Browserbase
from dotenv import load_dotenv
load_dotenv()
def create_session ():
"""Creates a Browserbase session."""
bb = Browserbase( api_key = os.environ[ "BROWSERBASE_API_KEY" ])
session = bb.sessions.create(
project_id = os.environ[ "BROWSERBASE_PROJECT_ID" ],
# Add configuration options here if needed
)
return session
def web_scrape ():
"""Automates form filling using Playwright with Browserbase."""
session = create_session()
print ( f "View session replay at https://browserbase.com/sessions/ { session.id } " )
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp(session.connect_url)
# Get the default browser context and page
context = browser.contexts[ 0 ]
page = context.pages[ 0 ]
# Navigate to the form page
page.goto( "https://books.toscrape.com/" )
# Extract the books from the page
items = page.locator( 'article.product_pod' )
books = items.all()
book_data_list = []
for book in books:
book_data = {
"title" : book.locator( 'h3 a' ).get_attribute( 'title' ),
"price" : book.locator( 'p.price_color' ).text_content(),
"image" : book.locator( 'div.image_container img' ).get_attribute( 'src' ),
"inStock" : book.locator( 'p.instock.availability' ).text_content().strip(),
"link" : book.locator( 'h3 a' ).get_attribute( 'href' )
}
book_data_list.append(book_data)
print ( "Shutting down..." )
page.close()
browser.close()
return book_data_list
if __name__ == "__main__" :
books = web_scrape()
print (books)
Example output
[
{
title: 'A Light in the Attic',
price: '£51.77',
image: 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
inStock: 'In stock',
link: 'catalogue/a-light-in-the-attic_1000/index.html'
},
...
]
Best Practices for Web Scraping
Follow these best practices to build reliable, efficient, and ethical web scrapers with Browserbase.
Ethical Scraping
Respect robots.txt : Check the website’s robots.txt file for crawling guidelines
Rate limiting : Implement reasonable delays between requests (2-5 seconds)
Terms of Service : Review the website’s terms of service before scraping
Data usage : Only collect and use data in accordance with the website’s policies
Batch processing : Process multiple pages in batches with concurrent sessions
Selective scraping : Only extract the data you need
Resource management : Close browser sessions promptly after use
Connection reuse : Reuse browsers for sequential scraping tasks
Stealth and Anti-Bot Avoidance
Enable Browserbase Advanced Stealth mode : Helps avoid bot detection
Randomize behavior : Add variable delays between actions
Use proxies : Rotate IPs to distribute requests
Mimic human interaction : Add realistic mouse movements and delays
Handle CAPTCHAs : Enable Browserbase’s automatic CAPTCHA solving
Next Steps
Now that you understand the basics of web scraping with Browserbase, here are some features to explore next: