[Bug]: Proxy Not Working with proxy_config option
crawl4ai version
Version: 0.4.248
Expected Behavior
The expected behavior is for the scraper to grab [ { "listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7", "listingTitle": "Private King Room-Shared Bath", "listingLocation": "San Diego, California, United States", "hostNameOnPropertyPage": "Stay with Sam", "hostProfileLinkOnPropertyPage": "/users/show/7597786", "hostWork": "Lives in San Diego, CA", "hostAbout": null, "hostLocation": null } ]
this data from the publicly available data from Airbnb.
Current Behavior
The scraper works, however, when I add a proxy instead of logging the correct output, I get timeout errors.
Is this reproducible?
Yes
Inputs Causing the Bug
'proxy_config': {
'server': 'residential-proxy.scrapeops.io:8181?'
'username': 'scrapeops',
'password': 'SCRAPE_OPS_PASSWORD',
'auth_type': 'basic'
},
This is defined within my config variable.
config = {
'initialUrl': 'https://www.airbnb.com/s/San-Diego/homes',
'selectors': {[SELECTORS_DEFINED_HERE]},
'browserConfig': {
'headless': False,
'verbose': True,
'proxy_config': {[shown above]},
'extra_args': ['--disable-blink-features=AutomationControlled', '--disable-images', '--disable-dev-shm-usage']
},
'maxListingsToScrape': 1,
'cityToSearch': 'San Diego'
}
Steps to Reproduce
Code snippets
How I define the crawler with browser_config
browser_config = BrowserConfig(**config['browserConfig'])
async with AsyncWebCrawler(config=browser_config) as crawler:
OS
macOS
Python version
Python 3.9.6
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
This is the expected behavior log:
INFO:main:Starting Airbnb scraper [INIT].... β Crawl4AI 0.4.247 INFO:main:Delay after search results page load: 8.16s [FETCH]... β Raw HTML... | Status: True | Time: 0.00s [SCRAPE].. β Processed Raw HTML... | Time: 18353ms [EXTRACT]. β Completed for Raw HTML... | Time: 0.15768066699999395s [COMPLETE] β Raw HTML... | Status: True | Total: 18.52s INFO:main:Found 24 listings INFO:main:Found 24 listing URLs INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7: 1.72s DOM content loaded after script execution in 0.00937199592590332 [FETCH]... β https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Time: 33.88s [SCRAPE].. β Processed https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 165ms [EXTRACT]. β Completed for https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 0.12295133399999258s [COMPLETE] β https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Total: 34.17s INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7 BEFORE JSON LOAD: [ { "listingTitle": "Private King Room-Shared Bath", "listingLocation": "San Diego, California, United States", "hostNameOnPropertyPage": "Stay with Sam", "hostProfileLinkOnPropertyPage": "/users/show/7597786" } ] INFO:main:Delay before host profile page: 11.00s [FETCH]... β https://www.airbnb.com/users/show/7597786... | Status: True | Time: 6.89s [SCRAPE].. β Processed https://www.airbnb.com/users/show/7597786... | Time: 75ms [EXTRACT]. β Completed for https://www.airbnb.com/users/show/7597786... | Time: 0.030629540999996152s [COMPLETE] β https://www.airbnb.com/users/show/7597786... | Status: True | Total: 7.00s INFO:main:Successfully processed 1 listings INFO:main:Final data: [ { "listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7", "listingTitle": "Private King Room-Shared Bath", "listingLocation": "San Diego, California, United States", "hostNameOnPropertyPage": "Stay with Sam", "hostProfileLinkOnPropertyPage": "/users/show/7597786", "hostWork": "Lives in San Diego, CA", "hostAbout": null, "hostLocation": null } ] INFO:main:Scraping completed
This is the logs of the failed output for when I add the proxy server: INFO:main:Starting Airbnb scraper [INIT].... β Crawl4AI 0.4.247 INFO:main:Delay after search results page load: 10.42s [FETCH]... β Raw HTML... | Status: True | Time: 0.01s [SCRAPE].. β Processed Raw HTML... | Time: 17973ms [EXTRACT]. β Completed for Raw HTML... | Time: 0.1515022909999999s [COMPLETE] β Raw HTML... | Status: True | Total: 18.14s INFO:main:Found 24 listings INFO:main:Found 24 listing URLs INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4: 5.37s [ERROR]... Γ https://www.airbnb.com/rooms/5862910?adults=1&cate... | Error: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Γ Unexpected error in _crawl_web at line 1205 in _crawl_web (crawl4ai/async_crawler_strategy.py): β β Error: Failed on navigating ACS-GOTO: β β Page.goto: Timeout 240000ms exceeded. β β Call log: β β - navigating to "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true β β &photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_173 β β 8549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4", β β waiting until "networkidle" β β β β β β Code context: β β 1200 β β 1201 response = await page.goto( β β 1202 url, wait_until=config.wait_until, timeout=config.page_timeout β β 1203 ) β β 1204 except Error as e: β β 1205 β raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") β β 1206 β β 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) β β 1208 β β 1209 if response is None: β β 1210 status_code = 200 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4 BEFORE JSON LOAD: None WARNING:main:Failed to extract property data for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4 ERROR:main:Main process failed: list index out of range INFO:main:Scraping completed
@unclecode
This is the issue I submitted, per our conversation on X.
@SashaGordin Thanks for sharing, Iβll check this out. @aravindkarnam, we had a similar discussion in another issue where I suggested using Playwrightβs page router. Please see if you can find it. The idea is to filter out unnecessary network requests from sites with heavy front and back-end communication. This reduces requests, lowers proxy demand, and speeds up the whole process.
We now moved to ProxyConfig and also implemented proxy rotation strategy.
Here's the sample code for how to use it
from crawl4ai import ProxyConfig, RoundRobinProxyStrategy
import asyncio
async def demo_proxy_rotation():
"""Proxy rotation for multiple requests"""
print("\n=== 10. Proxy Rotation ===")
# Example proxies (replace with real ones)
proxies = [
ProxyConfig(server="http://proxy1.example.com:8080"),
ProxyConfig(server="http://proxy2.example.com:8080"),
]
proxy_strategy = RoundRobinProxyStrategy(proxies)
print(f"Using {len(proxies)} proxies in rotation")
print(
"Note: This example uses placeholder proxies - replace with real ones to test"
)
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
proxy_rotation_strategy=proxy_strategy
)
# In a real scenario, these would be run and the proxies would rotate
print("In a real scenario, requests would rotate through the available proxies")
asyncio.run(demo_proxy_rotation())
Closing this issue since it seems to be stemming from timeouts on proxy due to heavy network requests -or- it could also be the proxy's IP got flagged as bot in the API server.
Closing this as a bug, but we'll consider changes to our future roadmap as Unclecode suggested.