toil Add random exponential backoff to with_retries: Toil goes rogue if with

When an operation completely fails using with_retries, Toil does not exit. Instead it remains as a rogue pipeline:

Cannot connect to LSF. Please wait ...
ls_info: A socket operation has failed: Address already in use
ERROR:toil.batchSystems.abstractGridEngineBatchSystem:Operation <bound method Worker.customGetJobExitCode of <Worker(Thread-321, started 46980713830144)>> failed with code 255: 
Exception in thread Thread-321:
Traceback (most recent call last):
  File "/bin/python/2.7.11/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/bin/python/.virtualenvs/prod__toil_mutect__v1.2.1/lib/python2.7/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 235, in run
    activity |= self.checkOnJobs()
  File "/bin/python/.virtualenvs/prod__toil_mutect__v1.2.1/lib/python2.7/site-packages/toil_container/lsf.py", line 109, in checkOnJobs
    status = with_retries(self.customGetJobExitCode, batchJobID, jobID)
  File "/bin/python/.virtualenvs/prod__toil_mutect__v1.2.1/lib/python2.7/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 48, in with_retries
    raise latest_err
CalledProcessError: Command '['bjobs', '-l', '4313569']' returned non-zero exit status 255.
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs
INFO:toil.leader:Reissued any over long jobs

I think it should either fail or retry checkOnJobs. It might also make sense to add a delay between retries. I this particular case, too many jobs were hitting LSF at the same time, a delay would have helped.

import random
import time


def with_retries(operation, *args, **kwargs):
    """Add a random sleep after each retry."""
    latest_err = Exception

    for i in [2, 3, 5, 10]:
        try:
            return operation(*args, **kwargs)
        except subprocess.CalledProcessError as err: # pragma: no cover
            time.sleep(i + random.uniform(-i, i))
            latest_err = err
            logger.error(
                "Operation %s failed with code %d: %s",
                operation,
                err.returncode,
                err.output,
            )

    raise latest_err

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-57

Jan 14 '19 16:01 jsmedmar

The delay and retry makes sense. I'm not sure if adding a failure here would affect other pipelines (@joelarmstrong ?).

Jan 16 '19 21:01 DailyDreaming

I agree that either an infinite retry or exiting the program entirely is a good idea. The current behavior of the batch system thread dying but everything else continuing is no good.

Jan 16 '19 22:01 joelarmstrong

I am seeing a similar behavior on a cluster using Slurm and toil 3.18 (specific tool: Progressivecactus https://github.com/ComparativeGenomicsToolkit/cactus)

INFO:toil.leader:Reissued any over long jobs appears in the log without jobs being actually submitted or anything happening, and I eventually have to manually stop the program.

Happy to test a fix if there is one.

Feb 25 '19 14:02 lassancejm

We can add random exponential backoff to with_retries.

But it sounds like there's another problem here that hasn't been diagnosed, maybe with the LSF batch system, where if an error raises out of a with_retries call somewhere, one thread maybe stops and the others keep going.

Sep 26 '23 17:09 adamnovak

This may also be related to #3014.

Sep 26 '23 17:09 adamnovak

Add random exponential backoff to with_retries: Toil goes rogue if with_retries fails, pipeline should exit