[Feature]: Optionally restarting on browser crashes with exponential backoff
The crawler currently handles browser crashes (or other interruptions) by exiting with a specific error code, and assuming that the crawler container will be restarted. This has many advantages, ensuring full cleanup, etc... and works well with Kubernetes pod behavior. Since we run the crawler in production only in Kubernetes, we have leaned into this behavior more and more.
However, we understand many users don't want to run the crawler in Kubernetes, or with an external controller or process manager.
For these deployments, having the crawler exit with a status code is not ideal, and I'm thinking if perhaps a wrapper shell script that does exponential backoff and restarts the node process would provide a good standalone feature? We would then default to running with restartsOnError set to true, and that would be the default path.
The exponential back-off script could be something simple, there's many example, like: https://gist.github.com/nathforge/62456d9b18e276954f58eb61bf234c17
It would need to have additional properties, to mimic the Kubernetes behavior:
- Reset the time if crawler is running successfully for some amount of time (eg. 10 without any exits)
- Don't restart on certain error codes, like time limit reached or out of disk space
As this wouldn't be our production deployment, we would want help from the community in testing this approach, as we won't have a lot of bandwidth to test this, especially for longer running crawls.
I can see it being helpful for issues such as one in #927 and especially for openzim/zimit#527
For users running larger-scale crawls with just Browsertrix Crawler (@benoit74, @gitreich @Mr0grog) would you be willing to help test this type of setup? What do you think of this approach?
What I'm thinking of is essentially having either the docker entrypoint script itself contain the optional exponential backoff, or start another shell script after fixing permissions, which then starts the crawl node process.
I'm sorry but I'm not really bought by the approach, especially regarding https://github.com/openzim/zimit/issues/527 ; our experience so far is that the problem is not always a browser crash, but rather something more hard to detect like all pages suddenly ending in timeouts. At least it became quite rare that crawler completely exits, this is my first uncertainty.
Second uncertainty is that from my experience it might be wise to let things cool down at least few minutes, in general hours is better. Or even move to a different machine. If all this is handled in a wrapper script then we have two issues: first we occupy a worker slot for "waiting a significant time" and second we cannot move the workload to another machine.
Maybe I'm aiming for the moon while this wrapper restart could be a first step.
And finally, I'm sure it will be difficult to test on our end, we rarely end up with browser crashes these days, at least from scrape I'm monitoring.
Let me think a bit about it again and I will come back to you. Thank you for taking the lead on that topic!
our experience so far is that the problem is not always a browser crash, but rather something more hard to detect like all pages suddenly ending in timeouts. At least it became quite rare that crawler completely exits, this is my first uncertainty.
I think 'all pages timing out continuously' should be treated same as a crash, should result in an interrupt exit code and restart. We can probably improve detection of this overall, similar to issues creating a new page. (Rate limiting from the server could be another case where we could perhaps issue an interrupt and expect to wait longer).
The backoff parameters could probably be made adjustable, or perhaps Zimfarm is not necessarily a good example for this. You can also do this in a Zimit process and have more control.
This is intended for users who want to avoid having to restart crawler manually, which is often necessarily for larger crawls.
Let me think a bit about it again and I will come back to you. Thank you for taking the lead on that topic!
Thanks for sharing your perspective @benoit74 ! Yes, I think we'd perhaps want more discussion before committing to supporting this yet.
Hi, we have implemented a restart mechanism semi - automated: The automated mechanims is triggered on ExitCode 1,9,10 but after a configurable tries (3 - 5) of restarts it will be presented to the User (=Me) and in most cases some adoption to the config is necessery: E.g. remove seeds that let the browser crash, or remove sitmeap or change the scope; However in our Domain Crawl out from ~17k Crawl Configs it was ~15 failing and not be "repaired" by the auotmatically restart and in all those cases the fix has been removing seeds from the original config (A couple of them you saw here, because they have been reproduceable in an isolated crawl, but some are not, mostly when they just not ending in a successfull crawl => DockerExit 9) So I think only restart will not be so helpfull in this cases and could be also making thinks even more difficult to getting to the point what's the problem; By the way, if regulary large crawls fails with something between 5GB and 50 GB I usually accept the broken crawl for the archive and do not trigger a manual restart (as it would also delete the crawl before) and give it a full try next time it will be scheduled So from my point of view the difference between Failed (no content but that is correct) and Failed (error in getting content) would be intresting; With Browser Crash having an extra Exit Code I am fine; Most of the other hard exits I saw were reasonable; Docker Exit Code 1 I only recognized when the Disk gets Full.
Ha, sounds like I may be an outlier — I think this would be a useful improvement for me, although a small one. Our crawls at EDGI already wrap Browsertrix Crawler with similar retry code (but it does not do exponential backoff) that works fine, although it would be nice to delete if something with more users and contributors built-in.
Anyway, I would be happy to use/test if you build it.
(Rate limiting from the server could be another case where we could perhaps issue an interrupt and expect to wait longer)
That would be a great, big deal improvement. It probably doesn’t need to be (maybe even harder to be) implemented in terms of starting and stopping the whole crawler, though. Most of my crawls involve multiple hosts, so I wouldn’t want to see a rate limit at host A cause host B to get paused, for example (that’s why I’ve had complicated ideas around queuing like this: https://github.com/webrecorder/browsertrix-crawler/issues/758#issuecomment-2727802474).
our experience so far is that the problem is not always a browser crash, but rather something more hard to detect like all pages suddenly ending in timeouts. At least it became quite rare that crawler completely exits, this is my first uncertainty.
I think 'all pages timing out continuously' should be treated same as a crash, should result in an interrupt exit code and restart. We can probably improve detection of this overall, similar to issues creating a new page.
This is pretty interesting and also sounds like a more important improvement than the actual retry logic. 🙂
For my part, We do see the crawler crash regularly, which is why I wrote our retry code. It has been an extremely effective improvement. BUT I suspect we encounter less crawler-blocking behavior than average since we are almost always just archiving US government websites. We may be encountering less blocking or different behavior in general than @benoit74.