argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Proposal: failureStrategy for DAGs and expandable tempaltes

Open simster7 opened this issue 4 years ago • 1 comments

Summary

Support more advanced strategies for when failing from a DAG or Steps template.

Use Cases

Currently a DAG has failFast, and recently templates also support failFast in conjunction with parallelism and with{Items,Prams}, etc. However, this could be extended further.

Something like:

failureStrategy:
  when: "{{numberFailed}} > 2 || {{numberSkipped}} > 0"
  terminateRunningPods: true     # or `false` to allow them to complete

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

simster7 avatar Mar 15 '21 01:03 simster7

My ideal failFast behavior is: the workflow should retry tasks that have retryStrategy set, but once all retries are exhausted, the workflow should stop scheduling other tasks and fail.

I'm on Argo 3.1, and I can't get this behavior. Is it possible, or is it expected that this doesn't work?

I tried the following permutations:

  • by default without any explicit configuration, my workflows don't fail fast: if a task fails, Argo tries to run the rest of the workflow instead of failing sooner.
  • when I add failFast: true to all my dag templates*, I do get failFast behavior, but then failed tasks don't retry, and the first task failure results in the workflow failing. This is failing too fast.
  • when I add failFast: true to only 1 of the dag templates, retries work but then I don't get failFast behavior. I tried adding failFast to each dag template separately, and all had the same behavior.

*all my dag templates are these 4:

  1. the outermost dag. this is the entrypoint, and just calls exit-handler-1
  2. exit-handler-1: just contains subgraph-2
  3. subgraph-2: contains my actual workflow tasks
  4. fol-loop-4: i have a for-loop for 5-fold cross validation

jli avatar Aug 19 '22 00:08 jli