crawl4ai This way takes way too long and won't work. Can we make it more efficient?

async def main(): async with AsyncWebCrawler( headless=True, # Set to False to see what is happening verbose=True, # New feature... user_agent_mode="random", # Optional... user_agent_generator_config={ "device_type": "mobile", "os_type": "android" }, ) as crawler: result = await crawler.arun( url='https://pixelscan.net/', cache_mode=CacheMode.BYPASS, html2text={ "ignore_links": True }, delay_before_return_html=2, screenshot=True )

    if result.success:
        print(len(result.markdown_v2.raw_markdown))

Dec 04 '24 08:12 ihoment-lys

@ihoment-lys Please take a look at your log file; you can see that most of the time goes to fetching the URL, which is not in the control of the Crawl4ai; that is the time we need for the server to do that. However, I applied some optimization, and for new version 0.4.1, I'm running the same url on my own machine, and I share with you the marked result.

[INIT].... → Crawl4AI 0.4.1
[FETCH]... ↓ https://pixelscan.net/... | Status: True | Time: 3.15s
[SCRAPE].. ◆ Processed https://pixelscan.net/... | Time: 38ms
[COMPLETE] ● https://pixelscan.net/... | Status: True | Total: 3.19s
2868

Dec 09 '24 09:12 unclecode

@ihoment-lys Vui lòng xem tệp nhật ký của bạn; bạn có thể thấy rằng hầu hết thời gian là để lấy URL, không nằm trong tầm kiểm soát của Crawl4ai; đó là thời gian chúng ta cần để máy chủ thực hiện việc đó. Tuy nhiên, tôi đã áp dụng một số tối ưu hóa và đối với phiên bản mới 0.4.1, tôi đang chạy cùng một url trên máy của mình và tôi chia sẻ với bạn kết quả đã đánh dấu.
[INIT].... → Crawl4AI 0.4.1
[FETCH]... ↓ https://pixelscan.net/... | Status: True | Time: 3.15s
[SCRAPE].. ◆ Processed https://pixelscan.net/... | Time: 38ms
[COMPLETE] ● https://pixelscan.net/... | Status: True | Total: 3.19s
2868

I’m also experiencing slowness. When running outside Docker, the speed is very fast, but inside the Docker environment, it becomes extremely slow. I’m working on a chatbot.

Dec 23 '24 03:12 QuangTQV

@QuangTQV There is an issue with the version of Docker, and I am working on it. In this issue, I explained a few things, and I am redesigning how the browser works inside Docker. I found a better, unconventional approach that is very different from others but definitely better. In the meantime, there are a few important things you need to consider, especially how you instantiate the AsyncWebCrawler class. I can close this issue and continue here to ensure we are all on the same page.

https://github.com/unclecode/crawl4ai/issues/361

Dec 25 '24 11:12 unclecode