crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Prevent Crawl4AI from Crawling After Link Failure – Only Extract Content

Open Pranshu172 opened this issue 1 year ago β€’ 4 comments

I noticed an issue with Crawl4AI where it initially extracts content from the given links as expected. However, once a link fails, the tool starts crawling the website, which I don’t want. The crawling process is slow and significantly increases the load on my PC, which is not ideal.

I would prefer to use Crawl4AI for content extraction only, without triggering any crawling action after a link failure. Is there any way to stop the crawling feature and ensure that the tool only extracts the content, regardless of whether a link fails?

I’m attaching screenshots below to help illustrate the problem:

Before any link fails: (shows expected content extraction) image

After a link fails: (shows that crawling starts unexpectedly) image

Could you provide guidance on how to disable the crawling feature while keeping the content extraction process intact?

Pranshu172 avatar Nov 07 '24 07:11 Pranshu172

Also what I noticed when it starts crawling it creates a new chromium for each of the link and once done it does not end that chromium/tasks because of that it starts to stack a lot of chromium's in my memory and then eventually my laptop freezes as it could not handle 30+ chromium's at a time.

If any fix available please help. Thank you

Pranshu172 avatar Nov 07 '24 09:11 Pranshu172

Also what I noticed when it starts crawling it creates a new chromium for each of the link and once done it does not end that chromium/tasks because of that it starts to stack a lot of chromium's in my memory and then eventually my laptop freezes as it could not handle 30+ chromium's at a time.

If any fix available please help. Thank you

I think you can specify a session_id so the crawler uses only one browser for the entire search. As per the documentation on session management,

async with AsyncWebCrawler() as crawler:
    session_id = "my_session"

    # First request
    result1 = await crawler.arun(
        url="https://example.com/page1",
        session_id=session_id
    )

    # Subsequent request using same session
    result2 = await crawler.arun(
        url="https://example.com/page2",
        session_id=session_id
    )

    # Clean up when done
    await crawler.crawler_strategy.kill_session(session_id)

result1 will open a browser with a session_id of my_session. result2 uses the browser of result1 using its session_id. As for the "crawling action". I am not quite sure. You can check out always_by_pass_cache. It may be related to the issue.

pttodv avatar Nov 07 '24 09:11 pttodv

okay I will try this session id, but about that always by pass cache is set to False only which I think it correct.

Pranshu172 avatar Nov 07 '24 10:11 Pranshu172

@Pranshu172 Thx for trying Crawl4ai. Regarding the first case, can you share a code snippet? I'm having trouble understanding the situation where passing multiple URLs and using run_many (I assume) functions causes the engine to crawl the wrong website if one of them fails. Basically, I think I need more explanation for your first case, Could you share the code and setup you're using? As for the second point, as @pttodv mentioned, you can use a session ID to maintain the session. Let me know if these things work for you or not. I'm currently improving the strategy for crawling multiple links and would appreciate feedback on any issues. The new strategy is significantly smarter than the current one.

unclecode avatar Nov 12 '24 05:11 unclecode

@Pranshu172 Closing this issue due to inactivity. Please open a new issue, if the problem still exists.

aravindkarnam avatar Jan 23 '25 06:01 aravindkarnam