ArchiveBot
ArchiveBot copied to clipboard
Failing URL list download can result in infinite loop and may stall the entire pipeline
Two weeks ago, I was playing around with alternatives for the lately very unstable transfer.sh and launched job 5ivs1btcd49uxizyj3nll0azt. Turns out that the pipeline did not like this at all:
Exception raised in DownloadUrlFile: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:720)
I haven't looked into what causes this exactly, but I assume it's an incompatibility in the SSL/TLS protocol versions supported by the pipeline and the server. But that would be another issue...
The pipeline has no limit on the number of retries, and because this is not a temporary error, it has been looping since two weeks now. This is blocking a pipeline slot, though that isn't that bad in this case because it's luckily an ao-only pipeline.
But today, things got worse. The entire pipeline suddenly stalled (13:15 UTC), and with it all jobs on it (last activity of each job between 13:24 and 13:27 UTC). As it turned out, the SSL connection on that broken job got stuck (possibly related to the stall issues in wpull, pointing to a more general issue in Python's ssl module rather than a bug in wpull?). The jobs seem to have gotten stuck due to log shipping issues. Here's a GDB traceback on the pipeline process 11 hours after everything stopped responding.
Fortunately, I was able to get things back running (for now) by killing the stuck SSL connection with my kill-wpull-connections script (which isn't actually limited to wpull when using the -p option). However, this really shouldn't be possible.
There are two things here that I think need to be fixed:
- Limit the number of retries on
DownloadUrlFile. Fail the job if the download still didn't succeed. - Move the download to a separate thread or even a separate process so it can't block the entire pipeline and all its wpull processes.
There's a separate issue already for the infinite loop: #207