crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

How modify page_timeout in crawler.arun_many mode

Open 1933211129 opened this issue 11 months ago β€’ 10 comments

Hi @unclecode , I have been using crawl4ai for a while and I am excited about every update, thank you for your contributions!

#436 ,in this issue says page_timout does not work for crawler.arun_many. But now I want to modify page_timout in 'arun_many' mode, whether I add 'config' or modify the parameters in the source code 'async_crawler_strategy.py' or 'config.py' file, It never worked in 'arun_many' mode, I wanted to make it shorter, but now I can't. Looking forward to your reply!

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig


async def extract_urls_and_descriptions(url_list: list):
    """
    ηˆ¬ε–ε€šδΈͺURLηš„ε†…ιƒ¨ι“ΎζŽ₯ε’ŒζθΏ°δΏ‘ζ―γ€‚
    """
    results = {}
    index = 1 

    async with AsyncWebCrawler(verbose=False) as crawler:
        
        try:
            config = CrawlerRunConfig(
                  page_timeout=5000
              )
            crawled_results = await crawler.arun_many(
                urls=url_list,
                config=config
            )

            # ε€„η†η»“ζžœ
            for result in crawled_results:
                if result.success:  
                    for category in ['internal']:  
                        for link in result.links.get(category, []):
                            link_url = link.get('href')
                            description = link.get('text', "")

                            
                            if link_url and (link_url.startswith("http") or link_url.startswith("https")):
                                results[index] = {link_url: description}  
                                index += 1  

        except Exception as e:
            print(f"ηˆ¬ε–ε‡Ίι”™: {e}\n")

    return results
async def main():
    url_list = [
    "http://www.people.com.cn/",
    "http://www.xinhuanet.com/",
    "https://news.sina.com.cn/",
    "https://news.qq.com/",
    "https://www.ccdi.gov.cn/"


]
    results = await extract_urls_and_descriptions(url_list)
    print(results)

asyncio.run(main())

Γ— Unexpected error in _crawl_web at line 1205 in _crawl_web (../usr/local/lib/python3.10/dist- β”‚ β”‚ packages/crawl4ai/async_crawler_strategy.py): β”‚ β”‚ Error: Failed on navigating ACS-GOTO: β”‚ β”‚ Page.goto: Timeout 60000ms exceeded. β”‚ β”‚ Call log: β”‚ β”‚ - navigating to "https://www.ccdi.gov.cn/", waiting until "domcontentloaded" β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Code context: β”‚ β”‚ 1200 β”‚ β”‚ 1201 response = await page.goto( β”‚ β”‚ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout β”‚ β”‚ 1203 ) β”‚ β”‚ 1204 except Error as e: β”‚ β”‚ 1205 β†’ raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") β”‚ β”‚ 1206 β”‚ β”‚ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) β”‚ β”‚ 1208 β”‚ β”‚ 1209 if response is None: β”‚ β”‚ 1210 status_code = 200

No matter how I modify the 'page_timeout' parameter, it always report 'Page.goto: Timeout 60000ms exceeded.'

1933211129 avatar Jan 15 '25 08:01 1933211129

I pass in a lot of links at once to get their subchains, but some of the links in the middle may access errors, I want to skip them as soon as possible, not try too much time, but the current default of 60 seconds is too long for me, I can't adjust it now.😭

1933211129 avatar Jan 15 '25 08:01 1933211129

@1933211129 Hello again, I have very good news for you. Tomorrow, I will drop a new version, and arun_many() has changed drastically. I made tons of optimizations for much faster and better parallel crawling. I tested it, and I will release it as a beta, so perhaps you can help test and debug it and provide your feedback. I will record a video to explain it.

Regarding setting the timeout, shouldn't be any problem, look at the code below:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig

async def main():
    config = BrowserConfig(
        headless=True,
    )

    async with AsyncWebCrawler(config=config) as crawler:
        crawl_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            page_timeout=1,
        )
        result = await crawler.arun(
            url="https://crawl4ai.com",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Look at the error message:

[INIT].... β†’ Crawl4AI 0.4.248
[ERROR]... Γ— https://crawl4ai.com... | Error: 
β”‚ Γ— Unexpected error in _crawl_web at line 1260 in _crawl_web (crawl4ai/async_crawler_strategy.py):                     β”‚
β”‚   Error: Failed on navigating ACS-GOTO:                                                                               β”‚
β”‚   Page.goto: Timeout 1ms exceeded.                                                                                    β”‚
β”‚   Call log:                             

As you can see it says Page.goto: Timeout 1ms exceeded.

Let me know if you have any problem with it. Anyway wait for the new version.

unclecode avatar Jan 15 '25 14:01 unclecode

Yes, just like the code you wrote above, the page_timeout setting works effectively for crawler.arun, but it doesn't take effect for crawler.arun_many.

1933211129 avatar Jan 15 '25 14:01 1933211129

Additionally, I have another issue to report, which is related to result.markdown_v2.fit_markdown and result.links

In the current version 0.4.24, this function doesn't seem to work effectively for the links I tested previouslyβ€”it returns raw_markdown. However, in version 0.4.21, fit_markdown was able to return very clean results, which is quite strange.

The same issue also appears with result.links. In versions prior to 0.4.x, it worked fine for retrieving the URL of the link. In version 0.4.1, however, it returned empty results without any errors. After upgrading to 0.4.24, it started working normally again.

This makes it a bit frustrating for my application development. To get cleaner markdown, I have to use version 0.4.21, but to get more stable results for result.links, I have to upgrade to version 0.4.24. This is very strange, and I’ve tested it multiple timesβ€”it doesn’t seem to be an issue with my network environment.

1933211129 avatar Jan 15 '25 15:01 1933211129

@1933211129 Sorry to hear that, Can you share the link for this one, Perhaps I can test it before release the new version.

unclecode avatar Jan 16 '25 12:01 unclecode

@unclecode http://www.las.cas.cn/

This link produces different results in version 0.4.0 and version 0.4.2, even when using the same code. This occurs in both fit_markdown and links, specifically in the run_many mode.

I noticed that many others have also mentioned this issue in other threads. There seems to be a problem with the extraction of fit_markdown in run_many mode.

By the way, do you have any updates on when the new version of arun_many() will be released? Looking forward to it!

1933211129 avatar Jan 16 '25 12:01 1933211129

@1933211129 I check the link, and I am checking to make sure there is no hidden bug between the two versions, and I confirm that I will release it. My desire is to do so before the weekend.

unclecode avatar Jan 16 '25 12:01 unclecode

@1933211129 At meantime check this https://docs.crawl4ai.com/advanced/multi-url-crawling/

unclecode avatar Jan 16 '25 12:01 unclecode

Please check this file and let me know if this is the expected result you need. This is the dumped version of the crawl result.

result.json

unclecode avatar Jan 16 '25 13:01 unclecode

Yes, this is the content of the webpage, but the fit_markdown in arun_many mode isn't functioning as intended, and this issue occurs with other links as well. Therefore, in version 0.4.24x, I'm resorting to using raw_markdown to ensure that many links don't consistently return empty values.

1933211129 avatar Jan 16 '25 13:01 1933211129

@1933211129 Please check this fit markdown, and let me know is this what you used to have?

markdown.md

unclecode avatar Jan 17 '25 09:01 unclecode

@unclecode I apologize for just seeing your reply now. The results are fantastic, and there's no noise at all. Regarding the bug I previously reported with arun_many, I've temporarily adopted the solution suggested in #461 , explicitly calling the filter_content function to clean up the content. I'm really looking forward to the new version on Monday! Thank you once again!😊

1933211129 avatar Jan 18 '25 04:01 1933211129

@1933211129 Glad to hear that. I release it by Monday :)

unclecode avatar Jan 18 '25 06:01 unclecode