amazon-genomics-cli icon indicating copy to clipboard operation
amazon-genomics-cli copied to clipboard

Add retry mechanism for Snakemake workflows running in spot contexts

Open ElDeveloper opened this issue 3 years ago • 1 comments
trafficstars

Description

Running in a spot context is very convenient, unfortunately when an instance is terminated an exception is raised and the workflow needs to be restarted.

In the naive case this makes the workflows unusable because you would have to restart the whole workflow but you can work around this by making the workflow write data directly to S3 and to /mnt/efs/snakemake/... (depending on what you are doing).

This is related to #320, although that issue was closed by a bot.

Use Case

This feature would long-running jobs in spot contexts usable.

Proposed Solution

Add a flag on the Snakemake command call to attempt to restart a terminated job or describe a parameter for each rules' resource directive. For example aws_batch_retries:

rule tabulate:
    input:
        sequences="seqs.qza"

    output:
        feature_table="feature_table.qza"

    resources:
       mem_mb=65536,
        _cores=8,
       aws_batch_retries=3

    script:
        "scripts/tabulate.py"

Other information

Currently when a spot instance is terminated you see the following exception:

Sat, 14 May 2022 10:06:59 -0700	The log details is {'status': 'FAILED', 'jobId': 'redacted', 'logStreamName': 'redacted'} with status FAILED
Sat, 14 May 2022 10:06:59 -0700	Full Traceback (most recent call last):
Sat, 14 May 2022 10:06:59 -0700	  File "/usr/local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 734, in _wait_thread
Sat, 14 May 2022 10:06:59 -0700	    self._wait_for_jobs()
Sat, 14 May 2022 10:06:59 -0700	  File "/usr/local/lib/python3.8/site-packages/snakemake/executors/aws_batch.py", line 396, in _wait_for_jobs
Sat, 14 May 2022 10:06:59 -0700	    status_code = self._get_job_status(j)
Sat, 14 May 2022 10:06:59 -0700	  File "/usr/local/lib/python3.8/site-packages/snakemake/executors/aws_batch.py", line 343, in _get_job_status
Sat, 14 May 2022 10:06:59 -0700	    raise WorkflowError(
Sat, 14 May 2022 10:06:59 -0700	snakemake.exceptions.WorkflowError: AWS Batch job interrupted (likely spot instance termination) with error {'jobId': 'redacted', 'statusReason': 'Host EC2 (instance XXXX) terminated.', 'logStreamName': 'redacted'}
Sat, 14 May 2022 10:06:59 -0700	WorkflowError:
Sat, 14 May 2022 10:06:59 -0700	AWS Batch job interrupted (likely spot instance termination) with error {'jobId': 'd9cd7b00-0226-45ff-a13e-970cad073f08', 'statusReason': 'Host EC2 (instance XXXX) terminated.', 'logStreamName': 'redacted'}
Sat, 14 May 2022 10:06:59 -0700	Shutting down, this might take some time.
Sat, 14 May 2022 10:06:59 -0700	shutting down
Sat, 14 May 2022 10:06:59 -0700	de-registering Batch job definition {[redacted]}
Sat, 14 May 2022 10:06:59 -0700	Exiting because a job execution failed. Look above for error message
Sat, 14 May 2022 10:07:00 -0700	Complete log: /mnt/efs/snakemake/fe6c36a2-7261-4ca8-a2a9-3e8e4539d904/.snakemake/log/2022-05-14T153619.508899.snakemake.log
Sat, 14 May 2022 10:07:00 -0700	unlocking
Sat, 14 May 2022 10:07:00 -0700	removing lock
Sat, 14 May 2022 10:07:00 -0700	removing lock
Sat, 14 May 2022 10:07:00 -0700	removed all locks
Sat, 14 May 2022 10:07:00 -0700	=== Running Cleanup ===
Sat, 14 May 2022 10:07:00 -0700	=== Bye! ===

ElDeveloper avatar May 14 '22 20:05 ElDeveloper

You could use the --retries option globaly or the retry directive for specific rules, but this will retry regardless of the type of failure.

nh13 avatar Jul 30 '22 15:07 nh13