配置

如何配置 Thordata 的抓取瀏覽器

本文將引導您完成 Thordata 抓取瀏覽器的整個配置與使用流程，包括憑證獲取、基礎配置、範例腳本執行及即時會話管理。遵循本指南，您將能夠快速上手並高效地進行網頁資料抓取。在開始之前，請先準備好您的帳戶憑證，即用於網路自動化工具的使用者名稱和密碼。您可以在 Thordata 抓取瀏覽器區域的「演示場」標籤頁中直接查看這些憑證資訊。我們假設您已獲得有效憑證，若尚未獲取，請從 Thordata 處申請。在使用抓取瀏覽器之前，需完成基礎環境配置。我們將逐步指導您完成身份憑證的配置、API 基本參數設置，以及如何在操作控制台中管理即時瀏覽器會話，助您更順暢地啟用瀏覽器功能。

抓取瀏覽器快速入門範例

我們為您準備了一系列抓取範例，幫助快速入門。您只需替換腳本中的個人憑證和目標網址，即可根據實際業務需求進行調整和擴展。如需編寫更複雜的抓取邏輯，可參考 Thordata 官方文件中支援的框架協議說明。您可以在儀表板中的「演示場」中線上調試腳本，也支援在本地環境中執行實際抓取任務。若選擇本地執行，請確保已安裝相應依賴（參考 Thordata 支援的框架協議），正確配置身份憑證後，執行範例腳本即可獲取目標資料。

import asyncio  
from playwright.async_api import async_playwright  
  
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const SBR_WS_SERVER = `wss://{AUTH}@ws-browser.thordata.com`;  
  
async def run(pw):  
    print('Connecting to Browser API...')  
    browser = await pw.chromium.connect_over_cdp(SBR_WS_SERVER)  
    try:  
        print('Connected! Navigating to Target...')  

        page = await browser.new_page()  
        await page.goto('https://example.com', timeout= 2 * 60 * 1000) 

        # Screenshot
        print('To Screenshot from page')  
        await page.screenshot(path='./remote_screenshot_page.png')  
        # html content
        print('Scraping page content...')  
        html = await page.content()  
        print(html)  
 
    finally:  
        # In order to better use the Browser API, be sure to close the browser 
        await browser.close()  
   
async def main():  
    async with async_playwright() as playwright:  
        await run(playwright)  
  
if _name_ == '_main_':  
 asyncio.run(main())

from selenium.webdriver import Remote, ChromeOptions  
from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection  
from selenium.webdriver.common.by import By  

# Enter your credentials - the zone name and password  
AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD'  
REMOTE_WEBDRIVER = f'https://{AUTH}@hs-browser.thordata.com'  

def main():  
    print('Connecting to Browser API...')  
    sbr_connection = ChromiumRemoteConnection(REMOTE_WEBDRIVER, 'goog', 'chrome')  
    with Remote(sbr_connection, options=ChromeOptions()) as driver:  

        # get target URL
        print('Connected! Navigating to target ...')  
        driver.get('https://example.com') 

        # screenshot 
        print('screenshot to png')  
        driver.get_screenshot_as_file('./remote_page.png')  

        # html content
        print('Get page content...')  
        html = driver.page_source  
        print(html)  
  
if __name__ == '__main__':  
   main()

const puppeteer = require('puppeteer-core');  

const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const WS_ENDPOINT = `wss://{AUTH}@ws-browser.thordata.com`;  
  
(async () => {
    console.log('Connecting to Scraping Browser...');  
    const browser = await puppeteer.connect({  
        browserWSEndpoint: SBR_WS_ENDPOINT,
        defaultViewport: {width: 1920, height: 1080}  
   });  
    try {  
        console.log('Connected! Navigating to Target URL');  
        const page = await browser.newPage();  
        
        await page.goto('https://example.com', { timeout: 2 * 60 * 1000 });  

        //1.Screenshot
        console.log('Screenshot to page.png');  
        await page.screenshot({ path: 'remote_screenshot.png' }); 
        console.log('Screenshot be saved');  

        //2.Get content
        console.log('Get page content...');  
        const html = await page.content();  
        console.log("source Htmml: ", html)  

    } finally {  
        // In order to better use the Browser API, be sure to close the browser after the script is executed
        await browser.close();  
   }  
})();

const pw = require('playwright');


const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';  
const SBR_CDP = `wss://{AUTH}@ws-browser.thordata.com`;  
  
async function main() {  
    console.log('Connecting to Browser API...');  
    const browser = await pw.chromium.connectOverCDP(SBR_CDP);  
    try {  
        console.log('Connected! Navigating to target...');  
        const page = await browser.newPage();
        // Target URL
        await page.goto('https://www.windows.com', { timeout: 2 * 60 * 1000 });  
        // Screenshot
        console.log('To Screenshot from page');  
        await page.screenshot({ path: './remote_screenshot_page.png'});  

        // html content
        console.log('Scraping page content...');  
        const html = await page.content();  
        console.log(html);  
    } finally {  
        // In order to better use the Browser API, be sure to close the browser after the script is executed
        await browser.close();  
   }  
}  
  
if (require.main === module) {  
    main().catch(err => {  
        console.error(err.stack || err);  
        process.exit(1);  
   });  
}

瀏覽器 API 初始導航

根據抓取瀏覽器的會話管理機制，每個會話僅允許執行一次初始導航，即首次載入目標網站以進行資料提取的操作。在此會話起點確立後，使用者便可在該網站內部通過點擊、捲動等互動動作自由跳轉。然而，任何需要從初始導航階段重新開始的抓取任務——無論目標是同一網站還是不同網站——都必須通過建立新會話來完成。

會話時間限制

自動超時機制：所有瀏覽器會話均受限於 30 分鐘的最大存活時間。若會話未通過腳本指令主動終止，系統將在此時間後自動將其結束。 Web 控制台特殊限制：在 Web 控制台環境中，系統強制實行單一帳戶單一活動會話的策略。為避免資源衝突與潛在錯誤，請在您的自動化腳本中務必加入顯式關閉會話的邏輯。如果您需要進一步配置方面的幫助，請隨時通過以下方式與我們聯繫： [email protected].

Previous快速開始 Next標準功能

Last updated 3 months ago

hashtag如何配置 Thordata 的抓取瀏覽器

hashtag 抓取瀏覽器快速入門範例

hashtag 瀏覽器 API 初始導航

hashtag 會話時間限制

如何配置 Thordata 的抓取瀏覽器

抓取瀏覽器快速入門範例

瀏覽器 API 初始導航

會話時間限制