redun icon indicating copy to clipboard operation
redun copied to clipboard

Feature request: detect when spot instances are pre-empted and re-submit

Open aksarkar opened this issue 1 year ago • 3 comments

We typically run Batch workflows on AWS spot instances to take advantage of cost savings when possible.

However, when some redun task is interrupted due to its host being terminated, the scheduler halts leading to potentially a lot of lost work.

It would be helpful for redun to detect this case and re-submit the task without halting, up to the configured maximum number of re-submits.

aksarkar avatar Jul 22 '24 20:07 aksarkar

Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting.

aksarkar avatar Jul 23 '24 15:07 aksarkar

Does using the retries option work for your use case? https://insitro.github.io/redun/config.html#retries

You can also catch errors using the catch() task. This can be done to implement more dynamic retry or recovery workflows.

Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting.

For behavior like this, we have something called catch_all(). It works for the specific case of evaluating tasks in a list, accumulating errors, and at the end allowing the user to decide what to do (fail, partial retry, etc).

I have thought about whether it's possible to define a different mode for error propagation in general. The current mode is eager raising, where one task failing causes all sibling active jobs to be abandoned, leading to the workflow to halt. One could image an opt-in to allowing sibling tasks to finish as much as possible before terminating the workflow. If you have ideas on syntax or examples from other workflow engines, I would be interested in ideas.

mattrasmus avatar Jul 23 '24 22:07 mattrasmus

Regarding retries: I am unsure that this will do what I want, which is to re-submit a job only when the last line of the Cloudwatch logs indicate that the instance was terminated.

In cases where there was an unrecoverable error (MemoryError, AssertionError, etc.) I do want the behavior where the workflow halts (eventually).

When using catch, am I able to easily get the Cloudwatch logs for the failing job?

Regarding the alternative I mentioned, one example is GNU parallel --halt soon,fail=1. The behavior in redun is analagous to --halt now,fail=1.

I would suggest implementing it as options in the redun.ini scheduler section.

aksarkar avatar Jul 24 '24 21:07 aksarkar