crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Does crawl4AI already have a function to scrape all related URLs from a root URL?

Open QuangTQV opened this issue 11 months ago • 3 comments

Description: Starting from a root URL, it will navigate to child URLs, scraping content in parallel while collecting links. It can utilize a queue or some other mechanism and should allow setting a maximum depth parameter. It returns a list of dictionaries, where each element contains information about a URL (link, content, depth, images, markdown, etc.)

QuangTQV avatar Jan 16 '25 09:01 QuangTQV

Hi @QuangTQV. We are currently working on a scraper module that takes in a root URL then does a Breadth First traversal until the configured depth is reached. It's currently under review and testing.

If you are interested in trying this out, you can check this PR https://github.com/aravindkarnam/crawl4ai/tree/scraper-uc

there's a quick start example in https://github.com/aravindkarnam/crawl4ai/blob/scraper-uc/docs/scraper/scraper_quickstart.py

Please see if this meets your requirements. We are working on some performance enhancements, post that this will be released.

aravindkarnam avatar Jan 16 '25 11:01 aravindkarnam

@QuangTQV We are working on the same thing, but the current solution based on crawl4ai is relatively slow. Could we discuss the implementation ideas more in the future?😀

1933211129 avatar Jan 16 '25 11:01 1933211129

@1933211129 @QuangTQV The new version 0.4.3 includes a very strong component for parallel crawling and will serve as the core for the "scraper" branch. I plan to merge the scraper branch that @aravindkarnam worked on. In the meantime, check this link to gather some ideas for parallel crawling: https://docs.crawl4ai.com/advanced/multi-url-crawling/

unclecode avatar Jan 16 '25 12:01 unclecode