[Bug]: Crawling has become worse for some urls compared to prevous versions
crawl4ai version
0.5.0
Expected Behavior
@unclecode When i crawl a website it should give me all pages not just home page.
Current Behavior
For some of the websites which earlier versions like 0.3.73 were able to crawl, are now not working on the new version of Crawl4AI.
Can you please investigate this.
Is this reproducible?
Yes
Inputs Causing the Bug
https://buildfaith.org
Steps to Reproduce
Just run the crawler to extract all urls of the link and it will just produce home page only.
Code snippets
OS
macOS
Python version
3.11
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
also I Found "https://arxiv.org/list/cs/recent",
V0.4.247 works well, but v0.4.248 can not crawl all content in the page
@vikaskookna @bigbrother666sh thx for sharing, let me check. @aravindkarnam can you work on it? I still not clear what means "it returns only homepage"
It means crawler just returns one page content which is home page rather than crawling all website urls in this particular case.
@vikaskookna I tried with older versions, still the site in question did not work. It's failing because there's a bot detection mechanism blocking us and giving a notice in html instead of actual page. It's more likely that the website recently implemented this mechanism, as we haven't changed any part of the code related to the scraping.
We are working on a workaround for this, though. It's currently being tested. I'll update this thread with a solution when that is released.
hello @aravindkarnam how about "https://arxiv.org/list/cs/recent"?
@bigbrother666sh For "https://arxiv.org/list/cs/recent", even the latest version worked(it fetched raw html). Maybe any specific functionality(markdown generation, extraction etc) is not working as expected. Could you elaborate on "can not crawl all content in the page". What's the missing content and is in raw html or markdown?
hi @aravindkarnam
sorry for late reply
here's what I got the raw_html
"html": "<!DOCTYPE html><html lang=\"en\"><head>\n <meta charset=\"utf-8\">\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n<!-- new favicon config and versions by realfavicongenerator.net -->\n<link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/apple-touch-icon.png\">\n<link rel=\"icon\" type=\"image/png\" sizes=\"32x32\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon-32x32.png\">\n<link rel=\"icon\" type=\"image/png\" sizes=\"16x16\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon-16x16.png\">\n<link rel=\"manifest\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/site.webmanifest\">\n<link rel=\"mask-icon\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/safari-pinned-tab.svg\" color=\"#b31b1b\">\n<link rel=\"shortcut icon\" href=\"https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon.ico\">\n<meta name=\"msapplication-TileColor\" content=\"#b31b1b\">\n<meta name=\"msapplication-config\" content=\"images/icons/browserconfig.xml\">\n<meta name=\"theme-color\" content=\"#b31b1b\">\n<!-- end favicon config -->\n<title>Search | arXiv e-print repository</title>\n<script defer=\"\" src=\"https://static.arxiv.org/static/base/1.0.0a5/fontawesome-free-5.11.2-web/js/all.js\"></script>\n<link rel=\"stylesheet\" href=\"https://static.arxiv.org/static/base/1.0.0a5/css/arxivstyle.css\">\n<script type=\"text/x-mathjax-config\">\n MathJax.Hub.Config({\n messageStyle: \"none\",\n extensions: [\"tex2jax.js\"],\n jax: [\"input/TeX\", \"output/HTML-CSS\"],\n tex2jax: {\n inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ],\n processEscapes: true,\n ignoreClass: '.*',\n processClass: 'mathjax.*'\n },\n TeX: {\n extensions: [\"AMSmath.js\", \"AMSsymbols.js\", \"noErrors.js\"],\n noErrors: {\n inlineDelimiters: [\"$\",\"$\"],\n multiLine: false,\n style: {\n \"font-size\": \"normal\",\n \"border\": \"\"\n }\n }\n },\n \"HTML-CSS\": { availableFonts: [\"TeX\"] }\n });\n</script>\n<script src=\"//static.arxiv.org/MathJax-2.7.3/MathJax.js\"></script></head></html>",
Obviously this is much less than the content of the page itself.
In addition, I noticed that during the crawling stage, there was an error
[[ERROR]... × Error updating image dimensions: Page.evaluate: Execution context was destroyed, most likely because of a navigation]
I found this phenomenon to occur in both 0.4.248 and 0.5.00.
@bigbrother666sh Can you share your code snippet as well. The following seems to indicate there's some config related stuff that's causing the error.
[[ERROR]... × Error updating image dimensions: Page.evaluate: Execution context was destroyed, most likely because of a navigation]
here it is:
config
md_generator = DefaultMarkdownGenerator(
options={
"skip_internal_links": True,
"escape_html": True,
"include_sup_sub": True
}
)
crawler_config = CrawlerRunConfig(
# session_id="my_session123",
delay_before_return_html=1.0,
word_count_threshold=10,
# keep_data_attributes=True,
scraping_strategy=LXMLWebScrapingStrategy(),
excluded_tags=['script', 'style'],
# exclude_domains=[],
# disable_cache=True,
markdown_generator=md_generator,
wait_until='commit',
# simulate_user=True,
magic=True,
scan_full_page=True,
scroll_delay=0.5,
)
browser_cfg = BrowserConfig(
# browser_type="chromium",
# headless=True,
viewport_width=1920,
viewport_height=1080,
# proxy="http://user:pass@proxy:8080",
# use_managed_browser=True,
# If you need authentication storage or repeated sessions, consider use_persistent_context=True and specify user_data_dir.
# use_persistent_context=True, # must be used with use_managed_browser=True
# user_data_dir="/tmp/crawl4ai_chromium_profile",
# java_script_enabled=True,
# cookies=[]
# headers={}
user_agent_mode="random",
light_mode=True,
extra_args=["--disable-gpu", "--disable-extensions"]
)
usage
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
while working_list:
url = working_list.pop()
existing_urls.add(url)
crawler_config.cache_mode = CacheMode.WRITE_ONLY if url in sites else CacheMode.ENABLED
try:
result = await crawler.arun(url=url, config=crawler_config)
except Exception as e:
_logger.error(e)
continue
....
await crawler.close()
@aravindkarnam
Hello @vikaskookna @bigbrother666sh could you update to our latest release v0.7.7 and let us know if you are still facing this issue.