feat: Sitemap-based request list implementation
This introduces an alternative RequestList implementation based on sitemaps. It should be possible to use this in tandem with RequestProvider in BasicCrawler, just like with the current RequestList.
In the future, this will make it possible to start crawling before the sitemap is finished loading.
- closes #2313
TODO
- [ ] make sure that the API is usable in actors that require this
- [ ] update examples, docs
How's this looking? Anything we can help with?
How's this looking? Anything we can help with?
Well, I want to add timeouts/cancellation. Also, we need to test if the SitemapRequestList survives migration. So if you'd like to do any of that, that would be super awesome. Otherwise, I think I'll finish it next week, or at the end of this one if I'm very lucky.
Alright, ready for the next round of reviews!
I simplified the parsing logic quite a lot (in my eyes) - in SitemapRequestList, there is now just one queue of parsed URLs and we're just keeping track of the remaining sitemaps to process (this is persisted on migrations).
https://github.com/apify/crawlee/pull/2498/commits/9e4a6208702c23c99a403555558080154b936ae0 adds a new helper method waitForNextRequest - basically an async generator for fetchNextRequest(). Blocks until there is a new request parsed from the sitemap (or ends if all sitemaps have been loaded and all URLs have been handled). It's implemented with active waiting, I'm not too happy about that, would be nice to have something better there.
https://github.com/apify/crawlee/pull/2498/commits/b1498a8daf9787cf049709f3d607e1496e0dc687 adds signal and timeoutMillis options - signal expects AbortSignal (see usage here), timeoutMillis expect a number (see usage here). Both options only apply to the sitemap download / parsing - if any of them are triggered, the user is left with a SitemapRequestList containing incomplete contents of the sitemap. Once you hit isFinished() === true (you mark all the requests as handled), you can check isSitemapFullyLoaded() - if this is false, the timeout / abort has cut the download + parsing short, if isSitemapFullyLoaded() === true, you have handled all the URLs from the sitemap.
Alright, time for the (yet another) final review!
My previous comment should provide enough guidance for the top-level ideas.