fix: Only reset root node of group type. Fixes #13196
Fixes #13196
Motivation
Single-node workflow with failed exit handler can not be retried correctly.
Modifications
Only reset root node of group type.
Verification
local test and e2e test
Sure.
/retest
Thanks for fixing. Would you mind taking me through what happened to help me understand? I don't know how the exit handler is related.
First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)
Thanks!
First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)
The root node of Pod type got reset to Running from Succeeded state, which is not as expected:
- Firstly, it should not re-run, because it is already
Succeeded. - Secondly, its pod is completed, so this node can not run correctly: new pod will not be created because it already exists and then this node turns to
Errorwith messagepod deleted.
Got it, thanks. I see the difference is the Group node would be a DAG, Steps, etc, which doesn't run.
And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?
And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?
If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.
I'm confused. Is this not a single node Workflow?:
version: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: entrypoint-fail
spec:
entrypoint: failing-entrypoint
templates:
- name: failing-entrypoint
container:
image: alpine:3.18
command: [sh, -c]
args: ["exit 1"]
I'm confused. Is this not a single node Workflow?:
Yes, this is a single node worklfow. But the root node is of pod type which is no need to be reset, since it will be deleted after manually retry.
# argo retry entrypoint-fail
INFO[2024-08-25T00:25:22.288Z] Deleting pod podDeleted=entrypoint-fail
INFO[2024-08-25T00:25:22.295Z] Workflow to be dehydrated Workflow Size=902
Name: entrypoint-fail
Namespace: argo
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning False
Completed False
Created: Sun Aug 25 00:24:35 +0800 (47 seconds ago)
Started: Sun Aug 25 00:25:22 +0800 (now)
Duration: 0 seconds
Progress: 0/1
ResourcesDuration: 0s*(1 cpu),2s*(100Mi memory)
So, just trying to understand - this statement isn't necessarily true in all cases, right?:
If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.
since the Workflow I have above is an example of an unsucceeded Workflow, right? But you're saying that in my example, the Root node is of type Pod?
And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?
Thank you for bearing with me. I'm guessing your change is a good one. I just want to make sure I fully understand.
Okay, I just decided to try running yours. This is the result:
nodes:
workflow-exit-handler-fail:
displayName: workflow-exit-handler-fail
finishedAt: "2024-08-25T18:42:37Z"
hostNodeName: lima-rancher-desktop
id: workflow-exit-handler-fail
name: workflow-exit-handler-fail
outputs:
artifacts:
- name: main-logs
s3:
key: workflow-exit-handler-fail/workflow-exit-handler-fail/main.log
exitCode: "0"
phase: Succeeded
progress: 1/1
resourcesDuration:
cpu: 0
memory: 5
startedAt: "2024-08-25T18:42:28Z"
templateName: echo
templateScope: local/workflow-exit-handler-fail
type: Pod
workflow-exit-handler-fail-925896592:
children:
- workflow-exit-handler-fail-1000917866
displayName: workflow-exit-handler-fail.onExit
finishedAt: "2024-08-25T18:42:47Z"
id: workflow-exit-handler-fail-925896592
message: child 'workflow-exit-handler-fail-2100426797' failed
name: workflow-exit-handler-fail.onExit
nodeFlag:
hooked: true
outboundNodes:
- workflow-exit-handler-fail-2100426797
phase: Failed
progress: 0/1
resourcesDuration:
cpu: 0
memory: 2
startedAt: "2024-08-25T18:42:40Z"
templateName: exit-handler
templateScope: local/workflow-exit-handler-fail
type: Steps
workflow-exit-handler-fail-1000917866:
boundaryID: workflow-exit-handler-fail-925896592
children:
- workflow-exit-handler-fail-2100426797
displayName: '[0]'
finishedAt: "2024-08-25T18:42:47Z"
id: workflow-exit-handler-fail-1000917866
message: child 'workflow-exit-handler-fail-2100426797' failed
name: workflow-exit-handler-fail.onExit[0]
nodeFlag: {}
phase: Failed
progress: 0/1
resourcesDuration:
cpu: 0
memory: 2
startedAt: "2024-08-25T18:42:40Z"
templateScope: local/workflow-exit-handler-fail
type: StepGroup
workflow-exit-handler-fail-2100426797:
boundaryID: workflow-exit-handler-fail-925896592
displayName: exit-handler-task
finishedAt: "2024-08-25T18:42:44Z"
hostNodeName: lima-rancher-desktop
id: workflow-exit-handler-fail-2100426797
message: Error (exit code 1)
name: workflow-exit-handler-fail.onExit[0].exit-handler-task
outputs:
artifacts:
- name: main-logs
s3:
key: workflow-exit-handler-fail/workflow-exit-handler-fail-fail-2100426797/main.log
exitCode: "1"
phase: Failed
progress: 0/1
resourcesDuration:
cpu: 0
memory: 2
startedAt: "2024-08-25T18:42:40Z"
templateName: fail
templateScope: local/workflow-exit-handler-fail
type: Pod
So, root is a Pod type. And in addition there's a Steps with child StepGroup with child Pod representing the ExitHandler.
The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node) Is it important to also reset Succeeded Root node, not just Errored?
And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?
The phase of the root node and the workflow is usually the same if there is no exit handler.
The type of root node is usually Steps or DAG since a workflow usually have many steps or tasks.
My example is very special which breaks the above two rules:
- root node is of
Podtype because there is only one node. - root node is
Succeededbut workflow isFailedbecause exit handler runs failed.
Workflow is Failed, so I can retry it, then the root node of Pod type is reset to Running from Succeeded which is unacceptable.
The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node)
I didn't notice this before. I agree with you that this special logic for resetting root node seems useless now since Steps is belong to GroupNode.
Is it important to also reset Succeeded Root node, not just Errored?
Perhaps to keep it consistent with the workflow phase?
Hey @terrytangyuan - I know this code was written a couple years back but do you have any input for this question?
It should be fine to remove that special handling.
It should be fine to remove that special handling.
Thanks for the quick response @terrytangyuan !