Configure Scraping browser
How to Configure Thordata's Scraping Browser
This article will guide you through the entire configuration and usage process of the Thordata Scraping Browser, including credential acquisition, basic setup, running sample scripts, and managing live sessions. By following this guide, you will be able to get started quickly and perform web data scraping efficiently.
Before You Begin
Please ensure you have your account credentials ready – that is, the username and password used for web automation tools.
You can view these credentials directly in the "Playground" tab within the Thordata Scraping Browser section. We assume you have obtained valid credentials. If not, please apply for them from Thordata.
Basic Configuration
Before using the Scraping Browser, basic environment configuration must be completed. We will guide you step-by-step through configuring your identity credentials, setting basic API parameters, and managing live browser sessions in the operations console, helping you enable the browser functionality more smoothly.
Scraping Browser Quick Start Examples
We have prepared a series of scraping examples to help you get started quickly. You only need to replace the personal credentials and target URL in the script, and then adjust and extend it according to your actual business needs. For writing more complex scraping logic, please refer to the supported framework protocols in the official Thordata documentation.
You can debug scripts online in the "Playground" within the dashboard, and it also supports executing actual scraping tasks in your local environment.
If you choose to run locally, please ensure the corresponding dependencies are installed (refer to Thordata's supported framework protocols). After correctly configuring your identity credentials, execute the sample script to obtain the target data.
import asyncio
from playwright.async_api import async_playwright
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';
const SBR_WS_SERVER = `wss://{AUTH}@ws-browser.thordata.com`;
async def run(pw):
print('Connecting to Browser API...')
browser = await pw.chromium.connect_over_cdp(SBR_WS_SERVER)
try:
print('Connected! Navigating to Target...')
page = await browser.new_page()
await page.goto('https://example.com', timeout= 2 * 60 * 1000)
# Screenshot
print('To Screenshot from page')
await page.screenshot(path='./remote_screenshot_page.png')
# html content
print('Scraping page content...')
html = await page.content()
print(html)
finally:
# In order to better use the Browser API, be sure to close the browser
await browser.close()
async def main():
async with async_playwright() as playwright:
await run(playwright)
if _name_ == '_main_':
asyncio.run(main())
from selenium.webdriver import Remote, ChromeOptions
from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection
from selenium.webdriver.common.by import By
# Enter your credentials - the zone name and password
AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD'
REMOTE_WEBDRIVER = f'https://{AUTH}@hs-browser.thordata.com'
def main():
print('Connecting to Browser API...')
sbr_connection = ChromiumRemoteConnection(REMOTE_WEBDRIVER, 'goog', 'chrome')
with Remote(sbr_connection, options=ChromeOptions()) as driver:
# get target URL
print('Connected! Navigating to target ...')
driver.get('https://example.com')
# screenshot
print('screenshot to png')
driver.get_screenshot_as_file('./remote_page.png')
# html content
print('Get page content...')
html = driver.page_source
print(html)
if __name__ == '__main__':
main()
const puppeteer = require('puppeteer-core');
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';
const WS_ENDPOINT = `wss://{AUTH}@ws-browser.thordata.com`;
(async () => {
console.log('Connecting to Scraping Browser...');
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
defaultViewport: {width: 1920, height: 1080}
});
try {
console.log('Connected! Navigating to Target URL');
const page = await browser.newPage();
await page.goto('https://example.com', { timeout: 2 * 60 * 1000 });
//1.Screenshot
console.log('Screenshot to page.png');
await page.screenshot({ path: 'remote_screenshot.png' });
console.log('Screenshot be saved');
//2.Get content
console.log('Get page content...');
const html = await page.content();
console.log("source Htmml: ", html)
} finally {
// In order to better use the Browser API, be sure to close the browser after the script is executed
await browser.close();
}
})();const pw = require('playwright');
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';
const SBR_CDP = `wss://{AUTH}@ws-browser.thordata.com`;
async function main() {
console.log('Connecting to Browser API...');
const browser = await pw.chromium.connectOverCDP(SBR_CDP);
try {
console.log('Connected! Navigating to target...');
const page = await browser.newPage();
// Target URL
await page.goto('https://www.windows.com', { timeout: 2 * 60 * 1000 });
// Screenshot
console.log('To Screenshot from page');
await page.screenshot({ path: './remote_screenshot_page.png'});
// html content
console.log('Scraping page content...');
const html = await page.content();
console.log(html);
} finally {
// In order to better use the Browser API, be sure to close the browser after the script is executed
await browser.close();
}
}
if (require.main === module) {
main().catch(err => {
console.error(err.stack || err);
process.exit(1);
});
}Browser API Initial Navigation
According to the Scraping Browser's session management mechanism, each session only allows one initial navigation – the operation that first loads the target website for data extraction. Once this session starting point is established, users can then navigate freely within that website via interactive actions like clicking and scrolling.
However, any scraping task that requires starting over from the initial navigation phase – whether targeting the same website or a different one – must be accomplished by creating a new session.
Session Time Limits
Automatic Timeout Mechanism: All browser sessions are limited to a maximum lifespan of 30 minutes. If a session is not actively terminated by a script command, the system will automatically end it after this period.
Web Console Specific Limit: In the Web Console environment, the system enforces a policy of one active session per account. To avoid resource conflicts and potential errors, please ensure your automation scripts include logic to explicitly close sessions.
If you require further assistance with configuration, please feel free to contact us at: [email protected].
Last updated
Was this helpful?