Add random exponential backoff to with_retries: Toil goes rogue if with_retries fails, pipeline should exit
When an operation completely fails using with_retries, Toil does not exit. Instead it remains as a rogue pipeline:
Cannot connect to LSF. Please wait ...
ls_info: A socket operation has failed: Address already in use
ERROR:toil.batchSystems.abstractGridEngineBatchSystem:Operation <bound method Worker.customGetJobExitCode of <Worker(Thread-321, started 46980713830144)>> failed with code 255:
Exception in thread Thread-321:
Traceback (most recent call last):
File "/bin/python/2.7.11/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/bin/python/.virtualenvs/prod__toil_mutect__v1.2.1/lib/python2.7/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 235, in run
activity |= self.checkOnJobs()
File "/bin/python/.virtualenvs/prod__toil_mutect__v1.2.1/lib/python2.7/site-packages/toil_container/lsf.py", line 109, in checkOnJobs
status = with_retries(self.customGetJobExitCode, batchJobID, jobID)
File "/bin/python/.virtualenvs/prod__toil_mutect__v1.2.1/lib/python2.7/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 48, in with_retries
raise latest_err
CalledProcessError: Command '['bjobs', '-l', '4313569']' returned non-zero exit status 255.
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
I think it should either fail or retry checkOnJobs. It might also make sense to add a delay between retries. I this particular case, too many jobs were hitting LSF at the same time, a delay would have helped.
import random
import time
def with_retries(operation, *args, **kwargs):
"""Add a random sleep after each retry."""
latest_err = Exception
for i in [2, 3, 5, 10]:
try:
return operation(*args, **kwargs)
except subprocess.CalledProcessError as err: # pragma: no cover
time.sleep(i + random.uniform(-i, i))
latest_err = err
logger.error(
"Operation %s failed with code %d: %s",
operation,
err.returncode,
err.output,
)
raise latest_err
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-57
The delay and retry makes sense. I'm not sure if adding a failure here would affect other pipelines (@joelarmstrong ?).
I agree that either an infinite retry or exiting the program entirely is a good idea. The current behavior of the batch system thread dying but everything else continuing is no good.
I am seeing a similar behavior on a cluster using Slurm and toil 3.18 (specific tool: Progressivecactus https://github.com/ComparativeGenomicsToolkit/cactus)
INFO:toil.leader:Reissued any over long jobs appears in the log without jobs being actually submitted or anything happening, and I eventually have to manually stop the program.
Happy to test a fix if there is one.
We can add random exponential backoff to with_retries.
But it sounds like there's another problem here that hasn't been diagnosed, maybe with the LSF batch system, where if an error raises out of a with_retries call somewhere, one thread maybe stops and the others keep going.
This may also be related to #3014.