crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: When body is hidden (e.g., in `<frame>`-based sites), AsyncPlaywrightCrawlerStrategy attribute error on 'config' (at `_crawl_web`)

Open sanghoho opened this issue 10 months ago β€’ 5 comments

crawl4ai version

0.4.248

Expected Behavior

Thank you for providing such an excellent open-source crawling library! I hope this detailed bug report is helpful in improving crawl4ai's robustness and handling of diverse website structures. I'm happy to contribute further or test any proposed solutions.

When crawling a website where the <body> element is hidden, particularly in sites that primarily use <frame> elements instead of a traditional <body> structure, crawl4ai should either:

  1. Gracefully handle the absence of a visible <body> and potentially attempt to extract content from the available frames.
  2. Raise a more informative exception that directly indicates the issue (e.g., "Body element not found or hidden. Consider sites with structure.").
  3. The error handling logic inside of _crawl_web will not be crashed.

Current Behavior

An AttributeError: 'AsyncPlaywrightCrawlerStrategy' object has no attribute 'config' is raised within the _crawl_web function of the AsyncPlaywrightCrawlerStrategy. This occurs after the crawler has already determined that the <body> element is hidden or unavailable, and is triggered during the error handling process itself.

Detailed Analysis of Current Behavior:

Analysis of the error code and source code reveals the following problem:

  • The AsyncPlaywrightCrawlerStrategy's __init__ method correctly sets self.browser_config https://github.com/unclecode/crawl4ai/blob/3b1025abbb6e2565602c05f9a959458da3531f3a/crawl4ai/async_crawler_strategy.py#L850-L864
  • However, in the _crawl_web function, when the code waits for the <body> element and it's not found (leading to an Error), the error handling logic attempts to access self.config, which has not been initialized. This is where the AttributeError is raised.

    https://github.com/unclecode/crawl4ai/blob/3b1025abbb6e2565602c05f9a959458da3531f3a/crawl4ai/async_crawler_strategy.py#L1394-L1402.

  • This issue seems to be specific to sites using <frame> structures where the <body> element is hidden or unavailable. The error handling for the missing <body> triggers the problem with self.config. It's likely that this hasn't been reported before because most websites have a visible <body> element.

Is this reproducible?

Yes

Inputs Causing the Bug

-   **URL(s):**
    -   http://www.ksma.co.kr/ (Korean site demonstrating the issue)
    -   unitednglobal.com (Another example of frame-based)
-   **Settings used:**
    
    from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
    from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy

    base_browser = BrowserConfig(
        browser_type="chromium",
        headless=False,  # Set to True for headless operation
        # text_mode=True #optional
    )

    run_config = CrawlerRunConfig(
        process_iframes=True,
        cache_mode=CacheMode.BYPASS,
        magic=True,
        simulate_user=True,
        override_navigator=True,
        page_timeout=7000,
        wait_until="networkidle"
    )

Steps to Reproduce

1.  Set up crawl4ai with the provided configuration (or a similar configuration).
2.  Attempt to crawl the URL `http://www.ksma.co.kr/` using the `AsyncWebCrawler`.
    
    async with AsyncWebCrawler(config=base_browser) as crawler:
        result = await crawler.arun(
            url="http://www.ksma.co.kr/",
            config=run_config
        )
    
3.  Observe the `AttributeError` raised within the `_crawl_web` function.

Code snippets


OS

macOS

Python version

3.11.5

Browser

Chromium

Browser version

No response

Error logs & Screenshots (if applicable)

[INIT].... β†’ Crawl4AI 0.4.248 [ERROR]... Γ— http://www.ksma.co.kr/... | Error: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Γ— Unexpected error in _crawl_web at line 1397 in _crawl_web (../../../.pyenv/versions/3.11.5/lib/python3.11/site- β”‚ β”‚ packages/crawl4ai/async_crawler_strategy.py): β”‚ β”‚ Error: 'AsyncPlaywrightCrawlerStrategy' object has no attribute 'config' β”‚ β”‚ β”‚ β”‚ Code context: β”‚ β”‚ 1392 raise Error(f"Body element is hidden: {visibility_info}") β”‚ β”‚ 1393 β”‚ β”‚ 1394 except Error: β”‚ β”‚ 1395 visibility_info = await self.check_visibility(page) β”‚ β”‚ 1396 β”‚ β”‚ 1397 β†’ if self.config.verbose: β”‚ β”‚ 1398 self.logger.debug( β”‚ β”‚ 1399 message="Body visibility info: {info}", β”‚ β”‚ 1400 tag="DEBUG", β”‚ β”‚ 1401 params={"info": visibility_info}, β”‚ β”‚ 1402 ) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

sanghoho avatar Feb 19 '25 09:02 sanghoho

bumping this

cybertheory avatar Mar 03 '25 04:03 cybertheory

I am experiencing this same issue. Happy to share the URLs for testing, if needed!

RyanLynchUF avatar Mar 10 '25 11:03 RyanLynchUF

@sanghoho Thanks for reporting the issue. I'll look into it shortly. @RyanLynchUF Yes that would be helpful in reproducing and testing the bug. Can you share more URLs?

aravindkarnam avatar Mar 12 '25 12:03 aravindkarnam

I had the same error listed in the original post, but I think my issue may have been a little different. It only occurred when using arun_many() on >5 URLs. I think it was a concurrency issue on my end, which prevented the pages from loading properly during the scraping. Everything seems to work fine as long as I use arun() or aruny_many() on <5 URLs at a time.

RyanLynchUF avatar Mar 14 '25 01:03 RyanLynchUF

Shouldn't this be either browser_config or without self.?

I get this error also but I think this is just "typo" of parameter and doesn't matter how to end up to error. If you end up here it will always fail because AsyncCrawlerStrategy doesn't have config but browser_config.

Jevli avatar Mar 15 '25 05:03 Jevli

Hi @aravindkarnam @unclecode , even i found the same issue, when trying to crawl a website, it is throwing same error.

code - `import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async def main(): browser_config = BrowserConfig(verbose=True, java_script_enabled=True, browser_type="chromium", headless=True, viewport_width=1280, viewport_height=720, user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") # Default browser configuration run_config = CrawlerRunConfig(check_robots_txt=True,scan_full_page=True) # Default crawl run configuration

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url="https://eminds.ai/",
        config=run_config
    )
    print(result.markdown)  # Print clean markdown content

if name == "main": asyncio.run(main())`

error - [INIT].... β†’ Crawl4AI 0.5.0.post4 [ERROR]... Γ— https://eminds.ai/... | Error: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Γ— Unexpected error in _crawl_web at line 622 in _crawl_web (../../../../Findly/py3.9.7/lib/python3.9/site- β”‚ β”‚ packages/crawl4ai/async_crawler_strategy.py): β”‚ β”‚ Error: 'AsyncPlaywrightCrawlerStrategy' object has no attribute 'config' β”‚ β”‚ β”‚ β”‚ Code context: β”‚ β”‚ 617 raise Error(f"Body element is hidden: {visibility_info}") β”‚ β”‚ 618 β”‚ β”‚ 619 except Error: β”‚ β”‚ 620 visibility_info = await self.check_visibility(page) β”‚ β”‚ 621 β”‚ β”‚ 622 β†’ if self.config.verbose: β”‚ β”‚ 623 self.logger.debug( β”‚ β”‚ 624 message="Body visibility info: {info}", β”‚ β”‚ 625 tag="DEBUG", β”‚ β”‚ 626 params={"info": visibility_info}, β”‚ β”‚ 627 ) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

None

Harinib-Kore avatar Apr 19 '25 08:04 Harinib-Kore

Thanks everyone for reporting this! I’ve already fixed it in the 2025-APR-1 branch, and it will be included in an upcoming release. In the meantime, feel free to check out the branch and help test it.

I’ll go ahead and close this issue, but don’t hesitate to continue the conversation here if needed!

cc @aravindkarnam

ntohidi avatar May 08 '25 10:05 ntohidi