Does crawl4AI already have a function to scrape all related URLs from a root URL?
Description: Starting from a root URL, it will navigate to child URLs, scraping content in parallel while collecting links. It can utilize a queue or some other mechanism and should allow setting a maximum depth parameter. It returns a list of dictionaries, where each element contains information about a URL (link, content, depth, images, markdown, etc.)
Hi @QuangTQV. We are currently working on a scraper module that takes in a root URL then does a Breadth First traversal until the configured depth is reached. It's currently under review and testing.
If you are interested in trying this out, you can check this PR https://github.com/aravindkarnam/crawl4ai/tree/scraper-uc
there's a quick start example in https://github.com/aravindkarnam/crawl4ai/blob/scraper-uc/docs/scraper/scraper_quickstart.py
Please see if this meets your requirements. We are working on some performance enhancements, post that this will be released.
@QuangTQV We are working on the same thing, but the current solution based on crawl4ai is relatively slow. Could we discuss the implementation ideas more in the future?😀
@1933211129 @QuangTQV The new version 0.4.3 includes a very strong component for parallel crawling and will serve as the core for the "scraper" branch. I plan to merge the scraper branch that @aravindkarnam worked on. In the meantime, check this link to gather some ideas for parallel crawling: https://docs.crawl4ai.com/advanced/multi-url-crawling/