Web data retrieval - Browserbase Documentation

Extract data from websites using cloud browsers that handle JavaScript rendering, bot protection, and dynamic content. Browserbase gives you reliable infrastructure for data extraction workflows, whether you’re using Stagehand or Playwright.

Scale data extraction across concurrent sessions without managing infrastructure
Browse protected sites with Browserbase’s Verified
Rotate IPs and geolocations with proxies
Debug and monitor extraction runs with session recordings and live views

Need scheduled or webhook-triggered data collection? Functions let you deploy data extraction workflows that can be invoked on-demand or on a schedule, perfect for building data pipelines and monitoring workflows.

Template

Get started quickly with a ready-to-use data extraction template.

Company Value Prop Generator

Clone, configure, and run in minutes

Example: Extracting a book catalog

To demonstrate data extraction with Browserbase, this example pulls book titles, prices, and availability from a sample catalog site.

Code example

Node.js
Python

import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";
import dotenv from "dotenv";
dotenv.config();

const stagehand = new Stagehand({
    env: "BROWSERBASE",
    verbose: 0,
});

async function scrapeBooks() {
    await stagehand.init();
    const page = stagehand.context.pages()[0];

    await page.goto("https://books.toscrape.com/");

    const scrape = await stagehand.extract({
        instruction: "Extract the books from the page",
        schema: z.object({
            books: z.array(z.object({
                title: z.string(),
                price: z.string(),
                image: z.string(),
                inStock: z.string(),
                link: z.string(),
            }))
        }),
    });

    console.log(scrape.books);

    await stagehand.close();
}

scrapeBooks().catch(console.error);

import { chromium } from "playwright-core";
import { Browserbase } from "@browserbasehq/sdk";
import * as dotenv from "dotenv";
dotenv.config();

async function createSession() {
    const bb = new Browserbase({ apiKey: process.env.BROWSERBASE_API_KEY });
    const session = await bb.sessions.create();

    return session;
}

async function scrapeBooks() {  
    const session = await createSession();
    const browser = await chromium.connectOverCDP(session.connectUrl);
    const defaultContext = browser.contexts()[0];
    const page = defaultContext.pages()[0];
    
    // Navigate to site
    await page.goto("https://books.toscrape.com/");

    // Extract the books from the page
    const books = await page.evaluate(() => {
        const items = document.querySelectorAll("article.product_pod");
        return Array.from(items).map(item => {
        const titleElement = item.querySelector("h3 > a");
        const priceElement = item.querySelector("p.price_color");
        const imageElement = item.querySelector("img");
        const inStockElement = item.querySelector("p.instock.availability");
        const linkElement = item.querySelector("h3 > a");

        return {
            title: titleElement?.getAttribute("title"),
            price: priceElement?.textContent,
            image: imageElement?.src,
            inStock: inStockElement?.textContent?.trim(),
            link: linkElement?.getAttribute("href")
        };
        });
    });

    await browser.close();
    return books;
}

const books = scrapeBooks().catch(console.error);
console.log(books);

import os
import asyncio
from stagehand import AsyncStagehand
from dotenv import load_dotenv

load_dotenv()

async def main():
    client = AsyncStagehand(
        browserbase_api_key=os.environ["BROWSERBASE_API_KEY"],
        model_api_key=os.environ["MODEL_API_KEY"],
    )
    session = await client.sessions.create(model_name="google/gemini-2.5-flash")

    # Navigate to the site
    await session.navigate(url="https://books.toscrape.com/")

    # Extract structured data from the page
    extract_response = await session.extract(
        instruction="Extract the books from the page including title, price, and stock status",
        schema={
            "type": "object",
            "properties": {
                "books": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "price": {"type": "string"},
                            "inStock": {"type": "string"},
                        },
                    },
                },
            },
        },
    )

    print(extract_response.data.result)

    await session.end()

if __name__ == "__main__":
    asyncio.run(main())

import os
from playwright.sync_api import sync_playwright
from browserbase import Browserbase
from dotenv import load_dotenv

load_dotenv()

def create_session():
    """Creates a Browserbase session."""
    bb = Browserbase(api_key=os.environ["BROWSERBASE_API_KEY"])
    session = bb.sessions.create(
        # Add configuration options here if needed
    )
    return session

def extract_books():
    """Extracts book data using Playwright with Browserbase."""
    session = create_session()
    print(f"View session recording at https://browserbase.com/sessions/{session.id}")

    with sync_playwright() as p:
        browser = p.chromium.connect_over_cdp(session.connect_url)

        # Get the default browser context and page
        context = browser.contexts[0]
        page = context.pages[0]

        # Navigate to the page
        page.goto("https://books.toscrape.com/")

        # Extract the books from the page
        items = page.locator('article.product_pod')
        books = items.all()

        book_data_list = []
        for book in books:

            book_data = {
                "title": book.locator('h3 a').get_attribute('title'),
                "price": book.locator('p.price_color').text_content(),
                "image": book.locator('div.image_container img').get_attribute('src'),
                "inStock": book.locator('p.instock.availability').text_content().strip(),
                "link": book.locator('h3 a').get_attribute('href')
            }
            
            book_data_list.append(book_data)

        print("Shutting down...")
        page.close()
        browser.close()

        return book_data_list

if __name__ == "__main__":
    books = extract_books()
    print(books)

Example output

[
  {
    title: 'A Light in the Attic',
    price: '£51.77',
    image: 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
    inStock: 'In stock',
    link: 'catalogue/a-light-in-the-attic_1000/index.html'
  },
  ...
]

Best practices for data extraction

Follow these best practices to build reliable, efficient, and ethical data extraction workflows with Browserbase.

Ethical data collection

Respect robots.txt: Check the website’s robots.txt file for crawling guidelines
Rate limiting: Implement reasonable delays between requests (2-5 seconds)
Terms of Service: Review the website’s terms of service before extracting data
Data usage: Only collect and use data in accordance with the website’s policies

Performance optimization

Batch processing: Process multiple pages in batches with concurrent sessions
Selective extraction: Only extract the data you need
Resource management: Close browser sessions promptly after use
Connection reuse: Reuse browsers for sequential extraction tasks

Protected sites

Enable Browserbase Verified: Recognized by bot protection partners
Randomize behavior: Add variable delays between actions
Use proxies: Rotate IPs to distribute requests
Mimic human interaction: Add realistic mouse movements and delays
Handle CAPTCHAs: Enable Browserbase’s automatic CAPTCHA solving

Next steps

Verified

Configure fingerprinting and CAPTCHA solving

Browser Contexts

Persist cookies and session data

Proxies

Configure IP rotation and geolocation

Browserbase Functions

Deploy data extraction workflows as cloud functions

​Template

Company Value Prop Generator

​Example: Extracting a book catalog

​Code example

​Example output

​Best practices for data extraction

​Ethical data collection

​Performance optimization

​Protected sites

​Next steps

Verified

Browser Contexts

Proxies

Browserbase Functions

Template

Example: Extracting a book catalog

Code example

Example output

Best practices for data extraction

Ethical data collection

Performance optimization

Protected sites

Next steps