crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat: Sitemap-based request list implementation

Open janbuchar opened this issue 1 year ago • 3 comments

This introduces an alternative RequestList implementation based on sitemaps. It should be possible to use this in tandem with RequestProvider in BasicCrawler, just like with the current RequestList.

In the future, this will make it possible to start crawling before the sitemap is finished loading.

  • closes #2313

TODO

  • [ ] make sure that the API is usable in actors that require this
  • [ ] update examples, docs

janbuchar avatar May 24 '24 09:05 janbuchar

How's this looking? Anything we can help with?

barjin avatar Jun 12 '24 09:06 barjin

How's this looking? Anything we can help with?

Well, I want to add timeouts/cancellation. Also, we need to test if the SitemapRequestList survives migration. So if you'd like to do any of that, that would be super awesome. Otherwise, I think I'll finish it next week, or at the end of this one if I'm very lucky.

janbuchar avatar Jun 12 '24 11:06 janbuchar

Alright, ready for the next round of reviews!

I simplified the parsing logic quite a lot (in my eyes) - in SitemapRequestList, there is now just one queue of parsed URLs and we're just keeping track of the remaining sitemaps to process (this is persisted on migrations).

https://github.com/apify/crawlee/pull/2498/commits/9e4a6208702c23c99a403555558080154b936ae0 adds a new helper method waitForNextRequest - basically an async generator for fetchNextRequest(). Blocks until there is a new request parsed from the sitemap (or ends if all sitemaps have been loaded and all URLs have been handled). It's implemented with active waiting, I'm not too happy about that, would be nice to have something better there.

https://github.com/apify/crawlee/pull/2498/commits/b1498a8daf9787cf049709f3d607e1496e0dc687 adds signal and timeoutMillis options - signal expects AbortSignal (see usage here), timeoutMillis expect a number (see usage here). Both options only apply to the sitemap download / parsing - if any of them are triggered, the user is left with a SitemapRequestList containing incomplete contents of the sitemap. Once you hit isFinished() === true (you mark all the requests as handled), you can check isSitemapFullyLoaded() - if this is false, the timeout / abort has cut the download + parsing short, if isSitemapFullyLoaded() === true, you have handled all the URLs from the sitemap.

barjin avatar Jun 26 '24 13:06 barjin

Alright, time for the (yet another) final review!

My previous comment should provide enough guidance for the top-level ideas.

barjin avatar Jul 03 '24 15:07 barjin