crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Documentation example fails (Crawling a Local HTML File)

Open ppetroskevicius opened this issue 7 months ago β€’ 2 comments

crawl4ai version

0.6.3

Expected Behavior

Examples from the official documentation are up to date and work. Below example should work:

import asyncio

from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig


async def crawl_local_file():
  local_file_path = "/home/user/file.html"
  file_url = f"file://{local_file_path}"
  config = CrawlerRunConfig(bypass_cache=True)

  async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url=file_url, config=config)
    if result.success:
      print("Markdown Content from Local File:")
      print(result.markdown)
    else:
      print(f"Failed to crawl local file: {result.error_message}")


asyncio.run(crawl_local_file())

Current Behavior

The example from documentation fails with error:

❯ python crawling_local_html_file.py
Traceback (most recent call last):
  File "/home/user/crawling_local_html_file.py", line 21, in <module>
    asyncio.run(crawl_local_file())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/user/crawling_local_html_file.py", line 10, in crawl_local_file
    config = CrawlerRunConfig(bypass_cache=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.venv/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 993, in __init__
    self.bypass_cache = bypass_cache
    ^^^^^^^^^^^^^^^^^
  File "/home/user/.venv/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 1101, in __setattr__
    raise AttributeError(f"Setting '{name}' is deprecated. {self._UNWANTED_PROPS[name]}")
AttributeError: Setting 'bypass_cache' is deprecated. Instead, use cache_mode=CacheMode.BYPASS

Is this reproducible?

Yes

Inputs Causing the Bug

- none

Steps to Reproduce

Run the example from documentation.

Code snippets


OS

Ubuntu 24.04

Python version

3.12.3

Browser

google-chrome

Browser version

No response

Error logs & Screenshots (if applicable)

Please see above.

ppetroskevicius avatar May 18 '25 01:05 ppetroskevicius

Changing the deprecated parameter as below does not work either.

import asyncio

from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig


async def crawl_local_file():
  local_file_path = "/home/user/file.html"
  file_url = f"file://{local_file_path}"
  config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)  # Changed here

  async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url=file_url, config=config)
    if result.success:
      print("Markdown Content from Local File:")
      print(result.markdown)
    else:
      print(f"Failed to crawl local file: {result.error_message}")


asyncio.run(crawl_local_file())
❯ python crawling_local_html_file.py
[INIT].... β†’ Crawl4AI 0.6.3
[ERROR]... Γ— file:///home/user/file.html  | Error: Unexpected error in _crawl_web at line 466 in crawl
(.venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: cannot access local variable 'captured_console' where it is not associated with a value

Code context:
 461                   html=html,
 462                   response_headers=response_headers,
 463                   status_code=status_code,
 464                   screenshot=screenshot_data,
 465                   get_delayed_content=None,
 466 β†’                 console_messages=captured_console,
 467               )
 468
 469           elif url.startswith("raw:") or url.startswith("raw://"):
 470               # Process raw HTML content
 471               raw_html = url[4:] if url[:4] == "raw:" else url[7:]
Failed to crawl local file: Unexpected error in _crawl_web at line 466 in crawl (.venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: cannot access local variable 'captured_console' where it is not associated with a value

Code context:
 461                   html=html,
 462                   response_headers=response_headers,
 463                   status_code=status_code,
 464                   screenshot=screenshot_data,
 465                   get_delayed_content=None,
 466 β†’                 console_messages=captured_console,
 467               )
 468
 469           elif url.startswith("raw:") or url.startswith("raw://"):
 470               # Process raw HTML content
 471               raw_html = url[4:] if url[:4] == "raw:" else url[7:]

ppetroskevicius avatar May 18 '25 01:05 ppetroskevicius

Hi @ppetroskevicius

Thank you for bringing our attention to updating the document. We do need to set the cache to cache_mode=CacheMode.BYPASS.

Regarding the error you've shared, it has already been fixed in the 2025-MAY-2 branch and will be included in our next release.

ntohidi avatar May 19 '25 15:05 ntohidi

@ntohidi I'm assuming this fix would also solve the same issue occuring for local markdown files?

I was able to get working with BYPass and using RAW format for a markdown input. By loading the content via the OS and passing the raw content to the crawler.

Lachlan-White avatar Jun 25 '25 13:06 Lachlan-White

If anyone is still experiencing this problem you can simply set capture_console_messages=True for now.

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig, CacheMode


async def crawl_local_file():
    local_file_path = "<local_file_path>"  # Replace with your file path
    file_url = f"file://{local_file_path}"
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS, capture_console_messages=True # see here
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=file_url, config=config)
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
        else:
            print(f"Failed to crawl local file: {result.error_message}")


asyncio.run(crawl_local_file())

abab-dev avatar Jul 01 '25 05:07 abab-dev

@Lachlan-White, what is the issue occurring with local markdown files? can u pls give me more details? :)

ntohidi avatar Jul 03 '25 08:07 ntohidi

@abab-dev I’m not sure if I understand your problem correctly, but if you’re having trouble getting your code to work, you can provide the absolute path to your file.

async def crawl_local_file_with_workaround():
    # Convert the relative file path to an absolute path
    absolute_path = os.path.abspath("output.html") # Adjust this path as needed
    file_url = f"file://{absolute_path}"
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS, 
        capture_console_messages=True
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=file_url, config=config)
        if result.success:
            print("\n--- Markdown Content from Local File ---")
            print(result.markdown.raw_markdown)
            
            print("\n--- Captured Console Messages ---")
            if result.console_messages:
                for msg in result.console_messages:
                    print(f"[{msg['type'].upper()}]: {msg['text']}")
            else:
                print("No console messages were captured.")
        else:
            print(f"Failed to crawl local file: {result.error_message}")

ntohidi avatar Jul 03 '25 09:07 ntohidi

@ntohidi I was trying to be helpful to everyone. Instead of

@ntohidi I'm assuming this fix would also solve the same issue occuring for local markdown files?

I was able to get working with BYPass and using RAW format for a markdown input. By loading the content via the OS and passing the raw content to the crawler.

doing this if we pass the flag capture_console_messages=True like I have shown you don't have to do this workaround.

abab-dev avatar Jul 03 '25 11:07 abab-dev

raw markdown wasn't an option in the documentation @ntohidi , so i just flagged it as raw html and it was able to work just fine :)

Lachlan-White avatar Jul 05 '25 03:07 Lachlan-White

@ntohidi I was trying to be helpful to everyone. Instead of

@ntohidi I'm assuming this fix would also solve the same issue occuring for local markdown files? I was able to get working with BYPass and using RAW format for a markdown input. By loading the content via the OS and passing the raw content to the crawler.

doing this if we pass the flag capture_console_messages=True like I have shown you don't have to do this workaround.

Oooh I understand now! Thank you for your help; really appreciate it! πŸ’œ

ntohidi avatar Jul 08 '25 10:07 ntohidi