amazon-genomics-cli
amazon-genomics-cli copied to clipboard
Add retry mechanism for Snakemake workflows running in spot contexts
Description
Running in a spot context is very convenient, unfortunately when an instance is terminated an exception is raised and the workflow needs to be restarted.
In the naive case this makes the workflows unusable because you would have to restart the whole workflow but you can work around this by making the workflow write data directly to S3 and to /mnt/efs/snakemake/... (depending on what you are doing).
This is related to #320, although that issue was closed by a bot.
Use Case
This feature would long-running jobs in spot contexts usable.
Proposed Solution
Add a flag on the Snakemake command call to attempt to restart a terminated job or describe a parameter for each rules' resource directive. For example aws_batch_retries:
rule tabulate:
input:
sequences="seqs.qza"
output:
feature_table="feature_table.qza"
resources:
mem_mb=65536,
_cores=8,
aws_batch_retries=3
script:
"scripts/tabulate.py"
Other information
Currently when a spot instance is terminated you see the following exception:
Sat, 14 May 2022 10:06:59 -0700 The log details is {'status': 'FAILED', 'jobId': 'redacted', 'logStreamName': 'redacted'} with status FAILED
Sat, 14 May 2022 10:06:59 -0700 Full Traceback (most recent call last):
Sat, 14 May 2022 10:06:59 -0700 File "/usr/local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 734, in _wait_thread
Sat, 14 May 2022 10:06:59 -0700 self._wait_for_jobs()
Sat, 14 May 2022 10:06:59 -0700 File "/usr/local/lib/python3.8/site-packages/snakemake/executors/aws_batch.py", line 396, in _wait_for_jobs
Sat, 14 May 2022 10:06:59 -0700 status_code = self._get_job_status(j)
Sat, 14 May 2022 10:06:59 -0700 File "/usr/local/lib/python3.8/site-packages/snakemake/executors/aws_batch.py", line 343, in _get_job_status
Sat, 14 May 2022 10:06:59 -0700 raise WorkflowError(
Sat, 14 May 2022 10:06:59 -0700 snakemake.exceptions.WorkflowError: AWS Batch job interrupted (likely spot instance termination) with error {'jobId': 'redacted', 'statusReason': 'Host EC2 (instance XXXX) terminated.', 'logStreamName': 'redacted'}
Sat, 14 May 2022 10:06:59 -0700 WorkflowError:
Sat, 14 May 2022 10:06:59 -0700 AWS Batch job interrupted (likely spot instance termination) with error {'jobId': 'd9cd7b00-0226-45ff-a13e-970cad073f08', 'statusReason': 'Host EC2 (instance XXXX) terminated.', 'logStreamName': 'redacted'}
Sat, 14 May 2022 10:06:59 -0700 Shutting down, this might take some time.
Sat, 14 May 2022 10:06:59 -0700 shutting down
Sat, 14 May 2022 10:06:59 -0700 de-registering Batch job definition {[redacted]}
Sat, 14 May 2022 10:06:59 -0700 Exiting because a job execution failed. Look above for error message
Sat, 14 May 2022 10:07:00 -0700 Complete log: /mnt/efs/snakemake/fe6c36a2-7261-4ca8-a2a9-3e8e4539d904/.snakemake/log/2022-05-14T153619.508899.snakemake.log
Sat, 14 May 2022 10:07:00 -0700 unlocking
Sat, 14 May 2022 10:07:00 -0700 removing lock
Sat, 14 May 2022 10:07:00 -0700 removing lock
Sat, 14 May 2022 10:07:00 -0700 removed all locks
Sat, 14 May 2022 10:07:00 -0700 === Running Cleanup ===
Sat, 14 May 2022 10:07:00 -0700 === Bye! ===
You could use the --retries option globaly or the retry directive for specific rules, but this will retry regardless of the type of failure.