argo-workflows fix: Only reset root node of group type. Fixes #13196

Fixes #13196

Motivation

Single-node workflow with failed exit handler can not be retried correctly.

Modifications

Only reset root node of group type.

Verification

local test and e2e test

Jun 17 '24 07:06 jswxstw

Sure.

Jun 18 '24 04:06 jswxstw

/retest

Jun 18 '24 06:06 jswxstw

Thanks for fixing. Would you mind taking me through what happened to help me understand? I don't know how the exit handler is related.

First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)

Thanks!

Aug 23 '24 21:08 juliev0

First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)

The root node of Pod type got reset to Running from Succeeded state, which is not as expected:

Firstly, it should not re-run, because it is already Succeeded.
Secondly, its pod is completed, so this node can not run correctly: new pod will not be created because it already exists and then this node turns to Error with message pod deleted.

Aug 24 '24 02:08 jswxstw

Got it, thanks. I see the difference is the Group node would be a DAG, Steps, etc, which doesn't run.

And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?

Aug 24 '24 04:08 juliev0

And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?

If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.

Aug 24 '24 13:08 jswxstw

I'm confused. Is this not a single node Workflow?:

version: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: entrypoint-fail
spec:
  entrypoint: failing-entrypoint
  templates:
  - name: failing-entrypoint
    container:
      image: alpine:3.18
      command: [sh, -c]
      args: ["exit 1"]

Aug 24 '24 16:08 juliev0

I'm confused. Is this not a single node Workflow?:

Yes, this is a single node worklfow. But the root node is of pod type which is no need to be reset, since it will be deleted after manually retry.

# argo retry entrypoint-fail
INFO[2024-08-25T00:25:22.288Z] Deleting pod                                  podDeleted=entrypoint-fail
INFO[2024-08-25T00:25:22.295Z] Workflow to be dehydrated                     Workflow Size=902
Name:                entrypoint-fail
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:          
 PodRunning          False
 Completed           False
Created:             Sun Aug 25 00:24:35 +0800 (47 seconds ago)
Started:             Sun Aug 25 00:25:22 +0800 (now)
Duration:            0 seconds
Progress:            0/1
ResourcesDuration:   0s*(1 cpu),2s*(100Mi memory)

Aug 24 '24 16:08 jswxstw

So, just trying to understand - this statement isn't necessarily true in all cases, right?:

If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.

since the Workflow I have above is an example of an unsucceeded Workflow, right? But you're saying that in my example, the Root node is of type Pod?

And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?

Thank you for bearing with me. I'm guessing your change is a good one. I just want to make sure I fully understand.

Aug 25 '24 17:08 juliev0

Okay, I just decided to try running yours. This is the result:

  nodes:
    workflow-exit-handler-fail:
      displayName: workflow-exit-handler-fail
      finishedAt: "2024-08-25T18:42:37Z"
      hostNodeName: lima-rancher-desktop
      id: workflow-exit-handler-fail
      name: workflow-exit-handler-fail
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: workflow-exit-handler-fail/workflow-exit-handler-fail/main.log
        exitCode: "0"
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 0
        memory: 5
      startedAt: "2024-08-25T18:42:28Z"
      templateName: echo
      templateScope: local/workflow-exit-handler-fail
      type: Pod
    workflow-exit-handler-fail-925896592:
      children:
      - workflow-exit-handler-fail-1000917866
      displayName: workflow-exit-handler-fail.onExit
      finishedAt: "2024-08-25T18:42:47Z"
      id: workflow-exit-handler-fail-925896592
      message: child 'workflow-exit-handler-fail-2100426797' failed
      name: workflow-exit-handler-fail.onExit
      nodeFlag:
        hooked: true
      outboundNodes:
      - workflow-exit-handler-fail-2100426797
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateName: exit-handler
      templateScope: local/workflow-exit-handler-fail
      type: Steps
    workflow-exit-handler-fail-1000917866:
      boundaryID: workflow-exit-handler-fail-925896592
      children:
      - workflow-exit-handler-fail-2100426797
      displayName: '[0]'
      finishedAt: "2024-08-25T18:42:47Z"
      id: workflow-exit-handler-fail-1000917866
      message: child 'workflow-exit-handler-fail-2100426797' failed
      name: workflow-exit-handler-fail.onExit[0]
      nodeFlag: {}
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateScope: local/workflow-exit-handler-fail
      type: StepGroup
    workflow-exit-handler-fail-2100426797:
      boundaryID: workflow-exit-handler-fail-925896592
      displayName: exit-handler-task
      finishedAt: "2024-08-25T18:42:44Z"
      hostNodeName: lima-rancher-desktop
      id: workflow-exit-handler-fail-2100426797
      message: Error (exit code 1)
      name: workflow-exit-handler-fail.onExit[0].exit-handler-task
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: workflow-exit-handler-fail/workflow-exit-handler-fail-fail-2100426797/main.log
        exitCode: "1"
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateName: fail
      templateScope: local/workflow-exit-handler-fail
      type: Pod

So, root is a Pod type. And in addition there's a Steps with child StepGroup with child Pod representing the ExitHandler.

Aug 25 '24 18:08 juliev0

The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node) Is it important to also reset Succeeded Root node, not just Errored?

Aug 25 '24 19:08 juliev0

And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?

The phase of the root node and the workflow is usually the same if there is no exit handler. The type of root node is usually Steps or DAG since a workflow usually have many steps or tasks. My example is very special which breaks the above two rules:

root node is of Pod type because there is only one node.
root node is Succeeded but workflow is Failed because exit handler runs failed.

Workflow is Failed, so I can retry it, then the root node of Pod type is reset to Running from Succeeded which is unacceptable.

Aug 26 '24 03:08 jswxstw

The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node)

I didn't notice this before. I agree with you that this special logic for resetting root node seems useless now since Steps is belong to GroupNode.

Is it important to also reset Succeeded Root node, not just Errored?

Perhaps to keep it consistent with the workflow phase?

Aug 26 '24 03:08 jswxstw

Hey @terrytangyuan - I know this code was written a couple years back but do you have any input for this question?

Aug 26 '24 04:08 juliev0

It should be fine to remove that special handling.

Aug 26 '24 16:08 terrytangyuan

It should be fine to remove that special handling.

Thanks for the quick response @terrytangyuan !

Aug 26 '24 17:08 juliev0