Proposal: failureStrategy for DAGs and expandable tempaltes
Summary
Support more advanced strategies for when failing from a DAG or Steps template.
Use Cases
Currently a DAG has failFast, and recently templates also support failFast in conjunction with parallelism and with{Items,Prams}, etc. However, this could be extended further.
Something like:
failureStrategy:
when: "{{numberFailed}} > 2 || {{numberSkipped}} > 0"
terminateRunningPods: true # or `false` to allow them to complete
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
My ideal failFast behavior is: the workflow should retry tasks that have retryStrategy set, but once all retries are exhausted, the workflow should stop scheduling other tasks and fail.
I'm on Argo 3.1, and I can't get this behavior. Is it possible, or is it expected that this doesn't work?
I tried the following permutations:
- by default without any explicit configuration, my workflows don't fail fast: if a task fails, Argo tries to run the rest of the workflow instead of failing sooner.
- when I add
failFast: trueto all my dag templates*, I do get failFast behavior, but then failed tasks don't retry, and the first task failure results in the workflow failing. This is failing too fast. - when I add
failFast: trueto only 1 of the dag templates, retries work but then I don't get failFast behavior. I tried adding failFast to eachdagtemplate separately, and all had the same behavior.
*all my dag templates are these 4:
- the outermost dag. this is the
entrypoint, and just callsexit-handler-1 exit-handler-1: just containssubgraph-2subgraph-2: contains my actual workflow tasksfol-loop-4: i have a for-loop for 5-fold cross validation