How modify page_timeout in crawler.arun_many mode
Hi @unclecode , I have been using crawl4ai for a while and I am excited about every update, thank you for your contributions!
#436 οΌin this issue says page_timout does not work for crawler.arun_many. But now I want to modify page_timout in 'arun_many' mode, whether I add 'config' or modify the parameters in the source code 'async_crawler_strategy.py' or 'config.py' file, It never worked in 'arun_many' mode, I wanted to make it shorter, but now I can't. Looking forward to your reply!
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def extract_urls_and_descriptions(url_list: list):
"""
η¬εε€δΈͺURLηε
ι¨ιΎζ₯εζθΏ°δΏ‘ζ―γ
"""
results = {}
index = 1
async with AsyncWebCrawler(verbose=False) as crawler:
try:
config = CrawlerRunConfig(
page_timeout=5000
)
crawled_results = await crawler.arun_many(
urls=url_list,
config=config
)
# ε€ηη»ζ
for result in crawled_results:
if result.success:
for category in ['internal']:
for link in result.links.get(category, []):
link_url = link.get('href')
description = link.get('text', "")
if link_url and (link_url.startswith("http") or link_url.startswith("https")):
results[index] = {link_url: description}
index += 1
except Exception as e:
print(f"η¬εεΊι: {e}\n")
return results
async def main():
url_list = [
"http://www.people.com.cn/",
"http://www.xinhuanet.com/",
"https://news.sina.com.cn/",
"https://news.qq.com/",
"https://www.ccdi.gov.cn/"
]
results = await extract_urls_and_descriptions(url_list)
print(results)
asyncio.run(main())
Γ Unexpected error in _crawl_web at line 1205 in _crawl_web (../usr/local/lib/python3.10/dist- β β packages/crawl4ai/async_crawler_strategy.py): β β Error: Failed on navigating ACS-GOTO: β β Page.goto: Timeout 60000ms exceeded. β β Call log: β β - navigating to "https://www.ccdi.gov.cn/", waiting until "domcontentloaded" β β β β β β Code context: β β 1200 β β 1201 response = await page.goto( β β 1202 url, wait_until=config.wait_until, timeout=config.page_timeout β β 1203 ) β β 1204 except Error as e: β β 1205 β raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") β β 1206 β β 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) β β 1208 β β 1209 if response is None: β β 1210 status_code = 200
No matter how I modify the 'page_timeout' parameter, it always report 'Page.goto: Timeout 60000ms exceeded.'
I pass in a lot of links at once to get their subchains, but some of the links in the middle may access errors, I want to skip them as soon as possible, not try too much time, but the current default of 60 seconds is too long for me, I can't adjust it now.π
@1933211129 Hello again, I have very good news for you. Tomorrow, I will drop a new version, and arun_many() has changed drastically. I made tons of optimizations for much faster and better parallel crawling. I tested it, and I will release it as a beta, so perhaps you can help test and debug it and provide your feedback. I will record a video to explain it.
Regarding setting the timeout, shouldn't be any problem, look at the code below:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
async def main():
config = BrowserConfig(
headless=True,
)
async with AsyncWebCrawler(config=config) as crawler:
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
page_timeout=1,
)
result = await crawler.arun(
url="https://crawl4ai.com",
config=crawl_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Look at the error message:
[INIT].... β Crawl4AI 0.4.248
[ERROR]... Γ https://crawl4ai.com... | Error:
β Γ Unexpected error in _crawl_web at line 1260 in _crawl_web (crawl4ai/async_crawler_strategy.py): β
β Error: Failed on navigating ACS-GOTO: β
β Page.goto: Timeout 1ms exceeded. β
β Call log:
As you can see it says Page.goto: Timeout 1ms exceeded.
Let me know if you have any problem with it. Anyway wait for the new version.
Yes, just like the code you wrote above, the page_timeout setting works effectively for crawler.arun, but it doesn't take effect for crawler.arun_many.
Additionally, I have another issue to report, which is related to result.markdown_v2.fit_markdown and result.links
In the current version 0.4.24, this function doesn't seem to work effectively for the links I tested previouslyβit returns raw_markdown. However, in version 0.4.21, fit_markdown was able to return very clean results, which is quite strange.
The same issue also appears with result.links. In versions prior to 0.4.x, it worked fine for retrieving the URL of the link. In version 0.4.1, however, it returned empty results without any errors. After upgrading to 0.4.24, it started working normally again.
This makes it a bit frustrating for my application development. To get cleaner markdown, I have to use version 0.4.21, but to get more stable results for result.links, I have to upgrade to version 0.4.24. This is very strange, and Iβve tested it multiple timesβit doesnβt seem to be an issue with my network environment.
@1933211129 Sorry to hear that, Can you share the link for this one, Perhaps I can test it before release the new version.
@unclecode http://www.las.cas.cn/
This link produces different results in version 0.4.0 and version 0.4.2, even when using the same code. This occurs in both fit_markdown and links, specifically in the run_many mode.
I noticed that many others have also mentioned this issue in other threads. There seems to be a problem with the extraction of fit_markdown in run_many mode.
By the way, do you have any updates on when the new version of arun_many() will be released? Looking forward to it!
@1933211129 I check the link, and I am checking to make sure there is no hidden bug between the two versions, and I confirm that I will release it. My desire is to do so before the weekend.
@1933211129 At meantime check this https://docs.crawl4ai.com/advanced/multi-url-crawling/
Please check this file and let me know if this is the expected result you need. This is the dumped version of the crawl result.
Yes, this is the content of the webpage, but the fit_markdown in arun_many mode isn't functioning as intended, and this issue occurs with other links as well. Therefore, in version 0.4.24x, I'm resorting to using raw_markdown to ensure that many links don't consistently return empty values.
@1933211129 Please check this fit markdown, and let me know is this what you used to have?
@unclecode I apologize for just seeing your reply now. The results are fantastic, and there's no noise at all. Regarding the bug I previously reported with arun_many, I've temporarily adopted the solution suggested in #461 , explicitly calling the filter_content function to clean up the content. I'm really looking forward to the new version on Monday! Thank you once again!π
@1933211129 Glad to hear that. I release it by Monday :)