crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Proxy Not Working with proxy_config option

Open SashaGordin opened this issue 11 months ago β€’ 2 comments

crawl4ai version

Version: 0.4.248

Expected Behavior

The expected behavior is for the scraper to grab [ { "listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7", "listingTitle": "Private King Room-Shared Bath", "listingLocation": "San Diego, California, United States", "hostNameOnPropertyPage": "Stay with Sam", "hostProfileLinkOnPropertyPage": "/users/show/7597786", "hostWork": "Lives in San Diego, CA", "hostAbout": null, "hostLocation": null } ]

this data from the publicly available data from Airbnb.

Current Behavior

The scraper works, however, when I add a proxy instead of logging the correct output, I get timeout errors.

Is this reproducible?

Yes

Inputs Causing the Bug

'proxy_config': { 
            'server': 'residential-proxy.scrapeops.io:8181?'
            'username': 'scrapeops',
            'password': 'SCRAPE_OPS_PASSWORD',
            'auth_type': 'basic'
        },


This is defined within my config variable.

config = {
    'initialUrl': 'https://www.airbnb.com/s/San-Diego/homes',
    'selectors': {[SELECTORS_DEFINED_HERE]},
    'browserConfig': {
        'headless': False,
        'verbose': True,
        'proxy_config': {[shown above]},
        'extra_args': ['--disable-blink-features=AutomationControlled', '--disable-images', '--disable-dev-shm-usage']
    },
    'maxListingsToScrape': 1,
    'cityToSearch': 'San Diego'
}

Steps to Reproduce


Code snippets

How I define the crawler with browser_config 

browser_config = BrowserConfig(**config['browserConfig'])
        async with AsyncWebCrawler(config=browser_config) as crawler:

OS

macOS

Python version

Python 3.9.6

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

This is the expected behavior log:

INFO:main:Starting Airbnb scraper [INIT].... β†’ Crawl4AI 0.4.247 INFO:main:Delay after search results page load: 8.16s [FETCH]... ↓ Raw HTML... | Status: True | Time: 0.00s [SCRAPE].. β—† Processed Raw HTML... | Time: 18353ms [EXTRACT]. β–  Completed for Raw HTML... | Time: 0.15768066699999395s [COMPLETE] ● Raw HTML... | Status: True | Total: 18.52s INFO:main:Found 24 listings INFO:main:Found 24 listing URLs INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7: 1.72s DOM content loaded after script execution in 0.00937199592590332 [FETCH]... ↓ https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Time: 33.88s [SCRAPE].. β—† Processed https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 165ms [EXTRACT]. β–  Completed for https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 0.12295133399999258s [COMPLETE] ● https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Total: 34.17s INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7 BEFORE JSON LOAD: [ { "listingTitle": "Private King Room-Shared Bath", "listingLocation": "San Diego, California, United States", "hostNameOnPropertyPage": "Stay with Sam", "hostProfileLinkOnPropertyPage": "/users/show/7597786" } ] INFO:main:Delay before host profile page: 11.00s [FETCH]... ↓ https://www.airbnb.com/users/show/7597786... | Status: True | Time: 6.89s [SCRAPE].. β—† Processed https://www.airbnb.com/users/show/7597786... | Time: 75ms [EXTRACT]. β–  Completed for https://www.airbnb.com/users/show/7597786... | Time: 0.030629540999996152s [COMPLETE] ● https://www.airbnb.com/users/show/7597786... | Status: True | Total: 7.00s INFO:main:Successfully processed 1 listings INFO:main:Final data: [ { "listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7", "listingTitle": "Private King Room-Shared Bath", "listingLocation": "San Diego, California, United States", "hostNameOnPropertyPage": "Stay with Sam", "hostProfileLinkOnPropertyPage": "/users/show/7597786", "hostWork": "Lives in San Diego, CA", "hostAbout": null, "hostLocation": null } ] INFO:main:Scraping completed

This is the logs of the failed output for when I add the proxy server: INFO:main:Starting Airbnb scraper [INIT].... β†’ Crawl4AI 0.4.247 INFO:main:Delay after search results page load: 10.42s [FETCH]... ↓ Raw HTML... | Status: True | Time: 0.01s [SCRAPE].. β—† Processed Raw HTML... | Time: 17973ms [EXTRACT]. β–  Completed for Raw HTML... | Time: 0.1515022909999999s [COMPLETE] ● Raw HTML... | Status: True | Total: 18.14s INFO:main:Found 24 listings INFO:main:Found 24 listing URLs INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4: 5.37s [ERROR]... Γ— https://www.airbnb.com/rooms/5862910?adults=1&cate... | Error: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Γ— Unexpected error in _crawl_web at line 1205 in _crawl_web (crawl4ai/async_crawler_strategy.py): β”‚ β”‚ Error: Failed on navigating ACS-GOTO: β”‚ β”‚ Page.goto: Timeout 240000ms exceeded. β”‚ β”‚ Call log: β”‚ β”‚ - navigating to "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true β”‚ β”‚ &photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_173 β”‚ β”‚ 8549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4", β”‚ β”‚ waiting until "networkidle" β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Code context: β”‚ β”‚ 1200 β”‚ β”‚ 1201 response = await page.goto( β”‚ β”‚ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout β”‚ β”‚ 1203 ) β”‚ β”‚ 1204 except Error as e: β”‚ β”‚ 1205 β†’ raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") β”‚ β”‚ 1206 β”‚ β”‚ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) β”‚ β”‚ 1208 β”‚ β”‚ 1209 if response is None: β”‚ β”‚ 1210 status_code = 200 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4 BEFORE JSON LOAD: None WARNING:main:Failed to extract property data for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4 ERROR:main:Main process failed: list index out of range INFO:main:Scraping completed

SashaGordin avatar Feb 03 '25 02:02 SashaGordin

@unclecode

This is the issue I submitted, per our conversation on X.

SashaGordin avatar Feb 03 '25 07:02 SashaGordin

@SashaGordin Thanks for sharing, I’ll check this out. @aravindkarnam, we had a similar discussion in another issue where I suggested using Playwright’s page router. Please see if you can find it. The idea is to filter out unnecessary network requests from sites with heavy front and back-end communication. This reduces requests, lowers proxy demand, and speeds up the whole process.

unclecode avatar Feb 04 '25 08:02 unclecode

We now moved to ProxyConfig and also implemented proxy rotation strategy.

Here's the sample code for how to use it

from crawl4ai import ProxyConfig, RoundRobinProxyStrategy
import asyncio

async def demo_proxy_rotation():
    """Proxy rotation for multiple requests"""
    print("\n=== 10. Proxy Rotation ===")

    # Example proxies (replace with real ones)
    proxies = [
        ProxyConfig(server="http://proxy1.example.com:8080"),
        ProxyConfig(server="http://proxy2.example.com:8080"),
    ]

    proxy_strategy = RoundRobinProxyStrategy(proxies)

    print(f"Using {len(proxies)} proxies in rotation")
    print(
        "Note: This example uses placeholder proxies - replace with real ones to test"
    )

    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            proxy_rotation_strategy=proxy_strategy
        )

        # In a real scenario, these would be run and the proxies would rotate
        print("In a real scenario, requests would rotate through the available proxies")

asyncio.run(demo_proxy_rotation())

Closing this issue since it seems to be stemming from timeouts on proxy due to heavy network requests -or- it could also be the proxy's IP got flagged as bot in the API server.

Closing this as a bug, but we'll consider changes to our future roadmap as Unclecode suggested.

aravindkarnam avatar May 08 '25 07:05 aravindkarnam