Configure Scraping browser

How to Configure Thordata's Scraping Browser

This article will guide you through the entire configuration and usage process of the Thordata Scraping Browser, including credential acquisition, basic setup, running sample scripts, and managing live sessions. By following this guide, you will be able to get started quickly and perform web data scraping efficiently.

Before You Begin

Please ensure you have your account credentials ready – that is, the username and password used for web automation tools.

You can view these credentials directly in the "Playground" tab within the Thordata Scraping Browser section. We assume you have obtained valid credentials. If not, please apply for them from Thordata.

Basic Configuration

Before using the Scraping Browser, basic environment configuration must be completed. We will guide you step-by-step through configuring your identity credentials, setting basic API parameters, and managing live browser sessions in the operations console, helping you enable the browser functionality more smoothly.

Scraping Browser Quick Start Examples

We have prepared a series of scraping examples to help you get started quickly. You only need to replace the personal credentials and target URL in the script, and then adjust and extend it according to your actual business needs. For writing more complex scraping logic, please refer to the supported framework protocols in the official Thordata documentation.

You can debug scripts online in the "Playground" within the dashboard, and it also supports executing actual scraping tasks in your local environment.

If you choose to run locally, please ensure the corresponding dependencies are installed (refer to Thordata's supported framework protocols). After correctly configuring your identity credentials, execute the sample script to obtain the target data.

import asyncio  
from playwright.async_api import async_playwright  
  
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const SBR_WS_SERVER = `wss://{AUTH}@ws-browser.thordata.com`;  
  
async def run(pw):  
    print('Connecting to Browser API...')  
    browser = await pw.chromium.connect_over_cdp(SBR_WS_SERVER)  
    try:  
        print('Connected! Navigating to Target...')  

        page = await browser.new_page()  
        await page.goto('https://example.com', timeout= 2 * 60 * 1000) 

        # Screenshot
        print('To Screenshot from page')  
        await page.screenshot(path='./remote_screenshot_page.png')  
        # html content
        print('Scraping page content...')  
        html = await page.content()  
        print(html)  
 
    finally:  
        # In order to better use the Browser API, be sure to close the browser 
        await browser.close()  
   
async def main():  
    async with async_playwright() as playwright:  
        await run(playwright)  
  
if _name_ == '_main_':  
 asyncio.run(main())
 

Browser API Initial Navigation

According to the Scraping Browser's session management mechanism, each session only allows one initial navigation – the operation that first loads the target website for data extraction. Once this session starting point is established, users can then navigate freely within that website via interactive actions like clicking and scrolling.

However, any scraping task that requires starting over from the initial navigation phase – whether targeting the same website or a different one – must be accomplished by creating a new session.

Session Time Limits

  • Automatic Timeout Mechanism: All browser sessions are limited to a maximum lifespan of 30 minutes. If a session is not actively terminated by a script command, the system will automatically end it after this period.

  • Web Console Specific Limit: In the Web Console environment, the system enforces a policy of one active session per account. To avoid resource conflicts and potential errors, please ensure your automation scripts include logic to explicitly close sessions.

If you require further assistance with configuration, please feel free to contact us at: [email protected].

Last updated

Was this helpful?