crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Again with website has anti-bot detection

Open aidenpearce001 opened this issue 9 months ago • 3 comments

crawl4ai version

0.5.0.post4

Expected Behavior

It's can scrape the data from the website,

Current Behavior

[ERROR]... × https://www.hifiboehm.de/de/produkt/sonos-sub-4-we... | Error: ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ × Unexpected error in _crawl_web at line 579 in _crawl_web (.venv/lib/python3.11/site- │ │ packages/crawl4ai/async_crawler_strategy.py): │ │ Error: Failed on navigating ACS-GOTO: │ │ Page.goto: Timeout 60000ms exceeded. │ │ Call log: │ │ - navigating to "https://www.hifiboehm.de/de/produkt/sonos-sub-4-weiss", waiting until "domcontentloaded" │ │ │ │ │ │ Code context: │ │ 574 response = await page.goto( │ │ 575 url, wait_until=config.wait_until, timeout=config.page_timeout │ │ 576 ) │ │ 577 redirected_url = page.url │ │ 578 except Error as e: │ │ 579 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │ │ 580 │ │ 581 await self.execute_hook( │ │ 582 "after_goto", page, context=context, url=url, response=response, config=config │ │ 583 ) │ │ 584 │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Is this reproducible?

Yes

Inputs Causing the Bug

- URL: https://www.hifiboehm.de/de/produkt/sonos-sub-4-weiss
- Setting used:
+ ["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"]
+ Headless True
+ user_agent_mode="random"
+ magic=True

Steps to Reproduce


Code snippets

class Crawl4AIAdapter:
    def __init__(self, headless: bool = True, verbose: bool = True):
        # Set up browser configuration with extra args for stability.
        self.browser_config = BrowserConfig(
            headless=headless,
            verbose=verbose,
            extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
        )
        # Use your preferred cache mode (here, DISABLED)
        self.crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        self.crawler = AsyncWebCrawler(
            config=self.browser_config,
            user_agent_mode="random",
            user_agent_generator_config={
                "device_type": "mobile",
                "os_type": "android"
            },
            magic=True,
        )

OS

Ubuntu 22.04

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

[ERROR]... × https://www.hifiboehm.de/de/produkt/sonos-sub-4-we... | Error: ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ × Unexpected error in _crawl_web at line 579 in _crawl_web (.venv/lib/python3.11/site- │ │ packages/crawl4ai/async_crawler_strategy.py): │ │ Error: Failed on navigating ACS-GOTO: │ │ Page.goto: Timeout 60000ms exceeded. │ │ Call log: │ │ - navigating to "https://www.hifiboehm.de/de/produkt/sonos-sub-4-weiss", waiting until "domcontentloaded" │ │ │ │ │ │ Code context: │ │ 574 response = await page.goto( │ │ 575 url, wait_until=config.wait_until, timeout=config.page_timeout │ │ 576 ) │ │ 577 redirected_url = page.url │ │ 578 except Error as e: │ │ 579 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │ │ 580 │ │ 581 await self.execute_hook( │ │ 582 "after_goto", page, context=context, url=url, response=response, config=config │ │ 583 ) │ │ 584 │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

aidenpearce001 avatar Mar 06 '25 17:03 aidenpearce001

I am facing similar issue. @aidenpearce001 . have you found any solution to this ?

LEVIII007 avatar Apr 09 '25 09:04 LEVIII007

Facing same issue in crawler.arun()

yash-prabhakar-singh avatar Jul 18 '25 18:07 yash-prabhakar-singh