Resume failed browsertrix crawls

Open benoit74 opened this issue 1 year ago • 0 comments

Every now and then, we have very long crawl to perform.

E.g. https://farm.openzim.org/recipes/shamela.ws_ar_al-tafsir-3 has ~500k pages to grab. Or https://farm.openzim.org/recipes/ubuntuforums.org_en_all which has already discoverd ~400k pages.

This poses two challenges to Browsertrix Crawler (warc2zim is always "quite fast"): the duration of the crawl, and the stability of the crawl. To reduce the duration, we usually set multiple workers (typically 4) to run in parallel. But it looks like it comes with a detrimental impact on stability of the crawl. Or at least, it often happens that the crawl fails with browser crash, disconnected, execution context destroyed, ...

I think we enhance zimit by automatically restartig the crawl after a failure, I know Browsertrix Cloud is capable to do it, probably based on https://crawler.docs.browsertrix.com/user-guide/common-options/#saving-crawl-state-interrupting-and-restarting-the-crawl

The most difficult part will of course be to know when it "worth-it" to restart the crawler.

Nov 26 '24 07:11 benoit74