Prevent Crawl4AI from Crawling After Link Failure β Only Extract Content
I noticed an issue with Crawl4AI where it initially extracts content from the given links as expected. However, once a link fails, the tool starts crawling the website, which I donβt want. The crawling process is slow and significantly increases the load on my PC, which is not ideal.
I would prefer to use Crawl4AI for content extraction only, without triggering any crawling action after a link failure. Is there any way to stop the crawling feature and ensure that the tool only extracts the content, regardless of whether a link fails?
Iβm attaching screenshots below to help illustrate the problem:
Before any link fails: (shows expected content extraction)
After a link fails: (shows that crawling starts unexpectedly)
Could you provide guidance on how to disable the crawling feature while keeping the content extraction process intact?
Also what I noticed when it starts crawling it creates a new chromium for each of the link and once done it does not end that chromium/tasks because of that it starts to stack a lot of chromium's in my memory and then eventually my laptop freezes as it could not handle 30+ chromium's at a time.
If any fix available please help. Thank you
Also what I noticed when it starts crawling it creates a new chromium for each of the link and once done it does not end that chromium/tasks because of that it starts to stack a lot of chromium's in my memory and then eventually my laptop freezes as it could not handle 30+ chromium's at a time.
If any fix available please help. Thank you
I think you can specify a session_id so the crawler uses only one browser for the entire search. As per the documentation on session management,
async with AsyncWebCrawler() as crawler:
session_id = "my_session"
# First request
result1 = await crawler.arun(
url="https://example.com/page1",
session_id=session_id
)
# Subsequent request using same session
result2 = await crawler.arun(
url="https://example.com/page2",
session_id=session_id
)
# Clean up when done
await crawler.crawler_strategy.kill_session(session_id)
result1 will open a browser with a session_id of my_session. result2 uses the browser of result1 using its session_id.
As for the "crawling action". I am not quite sure. You can check out always_by_pass_cache. It may be related to the issue.
okay I will try this session id, but about that always by pass cache is set to False only which I think it correct.
@Pranshu172 Thx for trying Crawl4ai. Regarding the first case, can you share a code snippet? I'm having trouble understanding the situation where passing multiple URLs and using run_many (I assume) functions causes the engine to crawl the wrong website if one of them fails. Basically, I think I need more explanation for your first case, Could you share the code and setup you're using? As for the second point, as @pttodv mentioned, you can use a session ID to maintain the session. Let me know if these things work for you or not. I'm currently improving the strategy for crawling multiple links and would appreciate feedback on any issues. The new strategy is significantly smarter than the current one.
@Pranshu172 Closing this issue due to inactivity. Please open a new issue, if the problem still exists.