Configure Scraping browser

How to Configure Thordata's Scraping Browser

This article will guide you through the entire configuration and usage process of the Thordata Scraping Browser, including credential acquisition, basic setup, running sample scripts, and managing live sessions. By following this guide, you will be able to get started quickly and perform web data scraping efficiently.

Before You Begin

Please ensure you have your account credentials ready – that is, the username and password used for web automation tools.

You can view these credentials directly in the "Playground" tab within the Thordata Scraping Browser section. We assume you have obtained valid credentials. If not, please apply for them from Thordata.

Basic Configuration

Before using the Scraping Browser, basic environment configuration must be completed. We will guide you step-by-step through configuring your identity credentials, setting basic API parameters, and managing live browser sessions in the operations console, helping you enable the browser functionality more smoothly.

Scraping Browser Quick Start Examples

We have prepared a series of scraping examples to help you get started quickly. You only need to replace the personal credentials and target URL in the script, and then adjust and extend it according to your actual business needs. For writing more complex scraping logic, please refer to the supported framework protocols in the official Thordata documentation.

You can debug scripts online in the "Playground" within the dashboard, and it also supports executing actual scraping tasks in your local environment.

If you choose to run locally, please ensure the corresponding dependencies are installed (refer to Thordata's supported framework protocols). After correctly configuring your identity credentials, execute the sample script to obtain the target data.

import asyncio  
from playwright.async_api import async_playwright  
  
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const SBR_WS_SERVER = `wss://{AUTH}@ws-browser.thordata.com`;  
  
async def run(pw):  
    print('Connecting to Browser API...')  
    browser = await pw.chromium.connect_over_cdp(SBR_WS_SERVER)  
    try:  
        print('Connected! Navigating to Target...')  

        page = await browser.new_page()  
        await page.goto('https://example.com', timeout= 2 * 60 * 1000) 

        # Screenshot
        print('To Screenshot from page')  
        await page.screenshot(path='./remote_screenshot_page.png')  
        # html content
        print('Scraping page content...')  
        html = await page.content()  
        print(html)  
 
    finally:  
        # In order to better use the Browser API, be sure to close the browser 
        await browser.close()  
   
async def main():  
    async with async_playwright() as playwright:  
        await run(playwright)  
  
if _name_ == '_main_':  
 asyncio.run(main())

from selenium.webdriver import Remote, ChromeOptions  
from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection  
from selenium.webdriver.common.by import By  

# Enter your credentials - the zone name and password  
AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD'  
REMOTE_WEBDRIVER = f'https://{AUTH}@hs-browser.thordata.com'  

def main():  
    print('Connecting to Browser API...')  
    sbr_connection = ChromiumRemoteConnection(REMOTE_WEBDRIVER, 'goog', 'chrome')  
    with Remote(sbr_connection, options=ChromeOptions()) as driver:  

        # get target URL
        print('Connected! Navigating to target ...')  
        driver.get('https://example.com') 

        # screenshot 
        print('screenshot to png')  
        driver.get_screenshot_as_file('./remote_page.png')  

        # html content
        print('Get page content...')  
        html = driver.page_source  
        print(html)  
  
if __name__ == '__main__':  
   main()

const puppeteer = require('puppeteer-core');  

const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const WS_ENDPOINT = `wss://{AUTH}@ws-browser.thordata.com`;  
  
(async () => {
    console.log('Connecting to Scraping Browser...');  
    const browser = await puppeteer.connect({  
        browserWSEndpoint: SBR_WS_ENDPOINT,
        defaultViewport: {width: 1920, height: 1080}  
   });  
    try {  
        console.log('Connected! Navigating to Target URL');  
        const page = await browser.newPage();  
        
        await page.goto('https://example.com', { timeout: 2 * 60 * 1000 });  

        //1.Screenshot
        console.log('Screenshot to page.png');  
        await page.screenshot({ path: 'remote_screenshot.png' }); 
        console.log('Screenshot be saved');  

        //2.Get content
        console.log('Get page content...');  
        const html = await page.content();  
        console.log("source Htmml: ", html)  

    } finally {  
        // In order to better use the Browser API, be sure to close the browser after the script is executed
        await browser.close();  
   }  
})();

const pw = require('playwright');


const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const SBR_CDP = `wss://{AUTH}@ws-browser.thordata.com`;  
  
async function main() {  
    console.log('Connecting to Browser API...');  
    const browser = await pw.chromium.connectOverCDP(SBR_CDP);  
    try {  
        console.log('Connected! Navigating to target...');  
        const page = await browser.newPage();
        // Target URL
        await page.goto('https://www.windows.com', { timeout: 2 * 60 * 1000 });  
        // Screenshot
        console.log('To Screenshot from page');  
        await page.screenshot({ path: './remote_screenshot_page.png'});  

        // html content
        console.log('Scraping page content...');  
        const html = await page.content();  
        console.log(html);  
    } finally {  
        // In order to better use the Browser API, be sure to close the browser after the script is executed
        await browser.close();  
   }  
}  
  
if (require.main === module) {  
    main().catch(err => {  
        console.error(err.stack || err);  
        process.exit(1);  
   });  
}

According to the Scraping Browser's session management mechanism, each session only allows one initial navigation – the operation that first loads the target website for data extraction. Once this session starting point is established, users can then navigate freely within that website via interactive actions like clicking and scrolling.

However, any scraping task that requires starting over from the initial navigation phase – whether targeting the same website or a different one – must be accomplished by creating a new session.

Session Time Limits

Automatic Timeout Mechanism: All browser sessions are limited to a maximum lifespan of 30 minutes. If a session is not actively terminated by a script command, the system will automatically end it after this period.
Web Console Specific Limit: In the Web Console environment, the system enforces a policy of one active session per account. To avoid resource conflicts and potential errors, please ensure your automation scripts include logic to explicitly close sessions.

If you require further assistance with configuration, please feel free to contact us at: [email protected].

PreviousGetting Started NextStandard Functions

Last updated 21 hours ago

Was this helpful?

How to Configure Thordata's Scraping Browser

Scraping Browser Quick Start Examples

Browser API Initial Navigation

Session Time Limits