browsertrix-crawler
browsertrix-crawler copied to clipboard
Crawl never exits
Launched a run via zimit using 0.7.0.beta.1 and the crawl process never exited.
Running browsertrix-crawler crawl: crawl --newContext page --waitUntil load,networkidle0 --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --sizeLimit 4294967296 --timeLimit 7200 --url https://www.abc.com.py/ --userAgentSuffix +Zimit [email protected] --cwd /output/.tmpzbb8lh9v --statsFilename /output/crawl.json
Time threshold reached 7272.276 > 7200, stopping
== Start: 2022-07-10 12:12:49.180
== Now: 2022-07-10 14:13:59.737 (running for 2.0 hours)
== Progress: 204 / 204 (100.00%), errors: 0 (0.00%)
== Remaining: 0.0 ms (@ 0.03 pages/second)
== Sys. load: 21.8% CPU / 22.4% memory
== Workers: 1
#0 IDLE
Saving crawl state to: /output/.tmpzbb8lh9v/collections/crawl-20220710121247300/crawls/crawl-20220710141359-7facd8c44598.yaml
Waiting to ensure pending data is written to WARCs...
It's been close to 24h now… It's not the first time I see this behavior. What could cause this to happen?
It looks like it got to the 'waiting to ensure pending data is written to WARCs...' so probably that check is stalling somehow... The above was the end of the output, right? The data is stored in redis, and when logging pywb, redis is also logged to ./logs/redis.log.
Perhaps there is some info that can be gleaned from that?
Will try to take a look.
Yes that was the end of the output ; as for the other logs, I don't have access to those anymore. Note that I haven't seen it again since this report but it's not frequent anyway.
In mitmproxy for example, who also has to wait for pending requests, IPC signals are handled and upon receiving SIGINT, a shutdown event is triggered. This will release a wait call here which in turn will continue the execution of the Done hooks, allowing all running mitmproxy addons to write their data to disk.
Currently browsertrix-crawler spawns puppeteer but tells it not to handle IPC signals. Maybe this is because browsertrix wants to handle SIGINT itself in a graceful manner. But this graceful manner means waiting for the pending requests and since it's waiting for those requests it won't respect the timeout.
I've a suggestion here. When the timeout is reached or a SIGINT received:
- browsertrix kills its puppeteer instances
- browsertrix cancells or marks as done all pending requests (because whatever data was in transit at that point was already intercepted by pywb)
This seems a lot like what @ikreymer is describing in webrecorder/browsertrix-cloud#298
I was reading this snippet of code. It seems to not be part of any method, I wonder when it's run, maybe on class instantiation?
https://github.com/webrecorder/browsertrix-crawler/blob/e22d95e2f07ef8a4cd3a4c309ee9ca0d6bab559e/crawler.js#L993-L995
The puppeteer Cluster and Redis are started with double the timeout that was passed to Browsertrix
https://github.com/webrecorder/browsertrix-crawler/blob/e22d95e2f07ef8a4cd3a4c309ee9ca0d6bab559e/crawler.js#L504
https://github.com/webrecorder/browsertrix-crawler/blob/e22d95e2f07ef8a4cd3a4c309ee9ca0d6bab559e/crawler.js#L173
The Time threshold reached message is thrown here
https://github.com/webrecorder/browsertrix-crawler/blob/e22d95e2f07ef8a4cd3a4c309ee9ca0d6bab559e/crawler.js#L456
and it sets interrupt to true, and that in turn will lead to this.crawlState.setDrain(true) . I think this might be a good place for browsertrix to cancel the pending resources and close the puppeteer Cluster.
@rgaudin I think this was likely fixed via 5c931275ed9077e43906be7eefb473c0b644ed2f, previously the crawler would wait indefinitely for pywb to finish writing WARC records, in some cases this could result it in never finishing..
@wsdookadr This is a slightly different issue, dealing with interrupts. Yes, currently the interrupt waits for the page to finish. Perhaps could add a separate mode for more abrupt interruption. The logic is a bit confusing there and could probably use more documentation..
I'll close this for now as I suspect its fixed, but feel free to comment if it happens again.