argo-workflows
argo-workflows copied to clipboard
Workflows that failed before upgrade to 3.5.6 fail to retry
Pre-requisites
- [X] I have double-checked my configuration
- [X] I have tested with the
:latestimage tag (i.e.quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on:latest. If not, I have explained why, in detail, in my description below. - [X] I have searched existing issues and could not find a match for this bug
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what did you expect to happen?
-
We were on 3.5.5
-
Some workflows failed
-
Upgraded to 3.5.6
-
Retried some of the workflows Result:
- The failed tasks within the workflows disappeared, keeping only the successful ones
- The workflows got stuck in a running state
- Still The workflows are shown in the failed list
- Unable to retry again, because it's running
- Unable to Stop
Reproducible on any workflow
Version
v3.5.6
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
any workflow reproduces it
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
I tested this purely on 3.5.6, and it fails if you attempt to retry this deliberately broken dag diamond.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: eacho
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: eacho
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [eacho, "{{inputs.parameters.message}}"]
A link to the slack discussion: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1714641906410049
Another piece of info: After rolling back to 3.5.5, retrying runs the workflow propperly, but by the end it gets stuck without changing status to Finished
For me, the workflow above that reproduces the issue on 3.5.6 doesn't reproduce it on 3.5.5. It may be that retrying workflows which have been touched by 3.5.6 is part of the problem, so recreate them fresh instead.
Ignore that last comment, it doesn't go wrong for me in a really basic workflows installation at all. 3.5.6 will retry happily there. I'll try and determine what the difference is with our production 3.5.6 and why it only fails there.
Our production has a workflowDefaults which enables retryStrategy for everything.
The following reproduces the error - note the retryStrategy at the top level template. Remove that or place retries elsewhere and it will work.
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
retryStrategy:
limit: 2
retryPolicy: OnError
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: eacho
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: eacho
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [eacho, "{{inputs.parameters.message}}"]
This works correctly with 3.5.5
This was broken by #12817.
I feel like this has got to be related to the root cause I mentioned in https://github.com/argoproj/argo-workflows/pull/12817#pullrequestreview-1995821336. Although the PR itself did not touch (automated) retry nodes. The manual retry logic needs a refactor in general.
We should also add all these failing test cases
I do think the retry node needs to be skipped when checking if the descendants have success nodes since it is virtual.
The following reproduces the error - note the
retryStrategyat the top level template.
To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered?
As in, the existence of a retryStrategy (with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on its retryPolicy or expression.
The following reproduces the error - note the
retryStrategyat the top level template.To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered? As in, the existence of a
retryStrategy(with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on itsretryPolicyorexpression.
This requires both a retryStrategy and a manual retry attempt, but the retryStrategy does not need to have been used, we just need the retry virtual node to be present. I don't believe the actual retryStrategy matters at all.