crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Crawling has become worse for some urls compared to prevous versions

Open vikaskookna opened this issue 10 months ago • 10 comments

crawl4ai version

0.5.0

Expected Behavior

@unclecode When i crawl a website it should give me all pages not just home page.

Current Behavior

For some of the websites which earlier versions like 0.3.73 were able to crawl, are now not working on the new version of Crawl4AI.

Can you please investigate this.

Is this reproducible?

Yes

Inputs Causing the Bug

https://buildfaith.org

Steps to Reproduce

Just run the crawler to extract all urls of the link and it will just produce home page only.

Code snippets


OS

macOS

Python version

3.11

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

vikaskookna avatar Mar 08 '25 20:03 vikaskookna

also I Found "https://arxiv.org/list/cs/recent",

V0.4.247 works well, but v0.4.248 can not crawl all content in the page

bigbrother666sh avatar Mar 09 '25 02:03 bigbrother666sh

@vikaskookna @bigbrother666sh thx for sharing, let me check. @aravindkarnam can you work on it? I still not clear what means "it returns only homepage"

unclecode avatar Mar 09 '25 03:03 unclecode

It means crawler just returns one page content which is home page rather than crawling all website urls in this particular case.

vikaskookna avatar Mar 09 '25 06:03 vikaskookna

@vikaskookna I tried with older versions, still the site in question did not work. It's failing because there's a bot detection mechanism blocking us and giving a notice in html instead of actual page. It's more likely that the website recently implemented this mechanism, as we haven't changed any part of the code related to the scraping.

We are working on a workaround for this, though. It's currently being tested. I'll update this thread with a solution when that is released.

aravindkarnam avatar Mar 12 '25 07:03 aravindkarnam

hello @aravindkarnam how about "https://arxiv.org/list/cs/recent"?

bigbrother666sh avatar Mar 12 '25 08:03 bigbrother666sh

@bigbrother666sh For "https://arxiv.org/list/cs/recent", even the latest version worked(it fetched raw html). Maybe any specific functionality(markdown generation, extraction etc) is not working as expected. Could you elaborate on "can not crawl all content in the page". What's the missing content and is in raw html or markdown?

aravindkarnam avatar Mar 12 '25 10:03 aravindkarnam

hi @aravindkarnam

sorry for late reply

here's what I got the raw_html

"html": "<!DOCTYPE html><html lang=\"en\"><head>\n <meta charset=\"utf-8\">\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n<!-- new favicon config and versions by realfavicongenerator.net -->\n<link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/apple-touch-icon.png\">\n<link rel=\"icon\" type=\"image/png\" sizes=\"32x32\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon-32x32.png\">\n<link rel=\"icon\" type=\"image/png\" sizes=\"16x16\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon-16x16.png\">\n<link rel=\"manifest\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/site.webmanifest\">\n<link rel=\"mask-icon\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/safari-pinned-tab.svg\" color=\"#b31b1b\">\n<link rel=\"shortcut icon\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon.ico\">\n<meta name=\"msapplication-TileColor\" content=\"#b31b1b\">\n<meta name=\"msapplication-config\" content=\"images/icons/browserconfig.xml\">\n<meta name=\"theme-color\" content=\"#b31b1b\">\n<!-- end favicon config -->\n<title>Search | arXiv e-print repository</title>\n<script defer=\"\" src=\"https://static.arxiv.org/static/base/1.0.0a5/fontawesome-free-5.11.2-web/js/all.js\"></script>\n<link rel=\"stylesheet\" href=\"https://static.arxiv.org/static/base/1.0.0a5/css/arxivstyle.css\">\n<script type=\"text/x-mathjax-config\">\n MathJax.Hub.Config({\n messageStyle: \"none\",\n extensions: [\"tex2jax.js\"],\n jax: [\"input/TeX\", \"output/HTML-CSS\"],\n tex2jax: {\n inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ],\n processEscapes: true,\n ignoreClass: '.*',\n processClass: 'mathjax.*'\n },\n TeX: {\n extensions: [\"AMSmath.js\", \"AMSsymbols.js\", \"noErrors.js\"],\n noErrors: {\n inlineDelimiters: [\"$\",\"$\"],\n multiLine: false,\n style: {\n \"font-size\": \"normal\",\n \"border\": \"\"\n }\n }\n },\n \"HTML-CSS\": { availableFonts: [\"TeX\"] }\n });\n</script>\n<script src=\"//static.arxiv.org/MathJax-2.7.3/MathJax.js\"></script></head></html>",

Obviously this is much less than the content of the page itself.

e529eb.json

Image

In addition, I noticed that during the crawling stage, there was an error

[[ERROR]... × Error updating image dimensions: Page.evaluate: Execution context was destroyed, most likely because of a navigation]

I found this phenomenon to occur in both 0.4.248 and 0.5.00.

bigbrother666sh avatar Mar 16 '25 11:03 bigbrother666sh

@bigbrother666sh Can you share your code snippet as well. The following seems to indicate there's some config related stuff that's causing the error.

[[ERROR]... × Error updating image dimensions: Page.evaluate: Execution context was destroyed, most likely because of a navigation]

aravindkarnam avatar Mar 17 '25 14:03 aravindkarnam

here it is:

config

md_generator = DefaultMarkdownGenerator(
        options={
            "skip_internal_links": True,
            "escape_html": True,
            "include_sup_sub": True
        }
    )

crawler_config = CrawlerRunConfig(
    # session_id="my_session123",
    delay_before_return_html=1.0,
    word_count_threshold=10,
    # keep_data_attributes=True,
    scraping_strategy=LXMLWebScrapingStrategy(),
    excluded_tags=['script', 'style'],
    # exclude_domains=[],
    # disable_cache=True,
    markdown_generator=md_generator, 
    wait_until='commit', 
    # simulate_user=True,
    magic=True, 
    scan_full_page=True,
    scroll_delay=0.5,
)

browser_cfg = BrowserConfig(
    # browser_type="chromium",
    # headless=True,
    viewport_width=1920,
    viewport_height=1080,
    # proxy="http://user:pass@proxy:8080",
    # use_managed_browser=True,
    # If you need authentication storage or repeated sessions, consider use_persistent_context=True and specify user_data_dir.
    # use_persistent_context=True, # must be used with use_managed_browser=True
    # user_data_dir="/tmp/crawl4ai_chromium_profile",
    # java_script_enabled=True,
    # cookies=[]
    # headers={}
    user_agent_mode="random",
    light_mode=True,
    extra_args=["--disable-gpu", "--disable-extensions"]
)

usage

    crawler = AsyncWebCrawler(config=browser_cfg)
    await crawler.start()
    while working_list:
        url = working_list.pop()
        existing_urls.add(url)
        crawler_config.cache_mode = CacheMode.WRITE_ONLY if url in sites else CacheMode.ENABLED
        try:
            result = await crawler.arun(url=url, config=crawler_config)
        except Exception as e:
            _logger.error(e)
            continue
        ....

    await crawler.close()


@aravindkarnam

bigbrother666sh avatar Mar 18 '25 14:03 bigbrother666sh

Hello @vikaskookna @bigbrother666sh could you update to our latest release v0.7.7 and let us know if you are still facing this issue.

Ahmed-Tawfik94 avatar Nov 21 '25 04:11 Ahmed-Tawfik94