argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Workflow stuck in `Running` state after child pod `Succeeded`; parent `StepGroup` not updated.

Open seanmfr opened this issue 3 months ago • 1 comments

Pre-requisites

  • [x] I have double-checked my configuration
  • [x] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • [x] I have searched existing issues and could not find a match for this bug
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Bug Description: A running workflow has become permanently "stuck." A child node within a StepGroup has successfully completed its execution and is in the Succeeded phase. However, its parent StepGroup node remains in the Running phase with finishedAt: null.

This inconsistency in the status.nodes tree has halted the workflow, and the controller is not advancing to the subsequent steps defined in the template. The child pod that completed was the first step in its parent Steps template.

Stuck Node Details (from status.nodes):

The parent node ...-1564949926 is stuck Running, even though its only child pod ...-3969013248 has Succeeded.

Child Pod (Completed):

my-workflow-cron-xxxx-3969013248:
  boundaryID: my-workflow-cron-xxxx-3665623116
  displayName: wait-for-phase-outcome
  finishedAt: "2025-11-11T17:29:28Z"
  hostNodeName: gke-my-cluster-node-3b1f9489-2rzx
  id: my-workflow-cron-xxxx-3969013248
  name: my-workflow-cron-xxxx[0].mainline-release-channel[12].promote.promote-core-service[4].create-phases-from-waves[1].create-phase(0:checkAlerts:false,name:wave-176-0,validationWait:30m)[6].wait-for-phase-outcome[0].wait-for-phase-outcome
  outputs:
    exitCode: "0"
    parameters:
    - name: success
      value: "false"
    - name: failure
      value: "true"
    - name: report
      value: |
        ```2025-11-11T16:18:36.480815Z mainline-candidate-20251111-57d7...:
        Criteria - wave-176-0
        Failed - 3

        fail: timed out after 70 mins + fail: 3 projects failed upgrade

        Projects still require upgrading to this version - e.g. in another phase```
  phase: Succeeded
  progress: 1/1
  startedAt: "2025-11-11T16:18:40Z"
  templateName: wait-for-phase-outcome
  templateScope: namespaced/pubsub
  type: Pod

Parent StepGroup (Stuck):

my-workflow-cron-xxxx-1564949926:
  boundaryID: my-workflow-cron-xxxx-3665623116
  children:
  - my-workflow-cron-xxxx-3969013248
  displayName: '[0]'
  finishedAt: null
  id: my-workflow-cron-xxxx-1564949926
  name: my-workflow-cron-xxxx[0].mainline-release-channel[12].promote.promote-core-service[4].create-phases-from-waves[1].create-phase(0:checkAlerts:false,name:wave-176-0,validationWait:30m)[6].wait-for-phase-outcome[0]
  nodeFlag: {}
  phase: Running
  progress: 1/1
  startedAt: "2025-11-11T16:18:40Z"
  templateScope: namespaced/pubsub
  type: StepGroup

Template Context: The stuck pod (...-3969013248) is the first step ([0].wait-for-phase-outcome) inside a Steps template named handle-phase-outcome. This template is, in turn, called by another Steps template (create-phase). The pod's outputs.parameters (e.g., success: "false") should be read by the subsequent steps in the handle-phase-outcome template, but these steps are never executed.

Attempts to Resolve (All Failed):

  1. Graceful Controller Restart:

    • kubectl rollout restart deployment/workflow-controller -n argo
    • Result: The new controller pod started but failed to reconcile the inconsistent state. The workflow remained stuck.
  2. argo stop:

    • argo stop my-workflow-cron-xxxx -n argo
    • Result: The command executed, but the workflow object's phase remained Running.
  3. argo retry:

    • argo retry my-workflow-cron-xxxx -n argo --node-field-selector id=...-3969013248
    • Result: Command failed with an error stating that nodes of a Running workflow cannot be retried.
  4. Patch Metadata (Force Reconcile):

    • kubectl patch workflow my-workflow-cron-xxxx -n argo --type merge -p '{"metadata":{"annotations":{"unstick-trigger":"1"}}}'
    • Result: The annotation was applied, but the controller did not correct the state.

Expected Behavior: The workflow shouldn't hang at all. Also, the workflow-controller (especially after a restart) should have detected that pod ...-3969013248 was Succeeded, updated the parent StepGroup ...-1564949926 to Succeeded, and then proceeded to the next steps in the handle-phase-outcome template.

Actual Behavior: The workflow phase remains Running, and the parent StepGroup phase also remains Running, effectively deadlocking the workflow. The controller appears unable to resolve this status.nodes inconsistency.

Testing on Latest: Reproducing the bug is difficult because it is not deterministic. It manifests sporadically during workflow execution in production under high load. For this reason it hasnt been validated on the latest tag.

Version(s)

3.7.2

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: bug-repro-stuck-step-
spec:
  entrypoint: main-steps
  templates:
  - name: main-steps
    steps:
    # This first step's StepGroup is what gets stuck in the 'Running' phase
    # even after its pod (from 'long-running-pod' template) succeeds.
    - - name: step-one
        template: long-running-pod

    # This step is never scheduled if the bug occurs, because the
    # controller believes 'step-one' is still running.
    - - name: step-two
        template: next-step-that-never-runs

  # This template simulates the child pod that runs, succeeds,
  # and produces output.
  - name: long-running-pod
    container:
      image: alpine:latest # Publicly available image
      command: ["/bin/sh", "-c"]
      args:
        - |
          echo "Starting pod, will run for 60 seconds..."
          sleep 60
          echo "Pod finished. Creating output."
          mkdir -p /tmp/outputs
          echo -n 'completed' > /tmp/outputs/status
    outputs:
      parameters:
      - name: status
        valueFrom:
          path: /tmp/outputs/status

  # This template is for the step that should run after 'long-running-pod'
  - name: next-step-that-never-runs
    container:
      image: alpine:latest # Publicly available image
      command: ["echo", "--- WORKFLOW UNSTUCK --- This step (step-two) has run."]


This is an example workflow that mimics the behaviour we see in production.

Logs from the workflow controller

logs are sanitised for bug report

level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Task-result reconciliation"" namespace=argo numObjs=111 workflow=my-workflow-xxxx
level=info msg=""Processing workflow"" Phase=Running ResourceVersion=XXXXXXXXXXXXXXXXXXX namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=warning msg=""Deadline exceeded"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Task-result reconciliation"" namespace=argo numObjs=111 workflow=my-workflow-xxxx
level=info msg=""Processing workflow"" Phase=Running ResourceVersion=XXXXXXXXXXXXXXXXXXX namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=warning msg=""Deadline exceeded"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Task-result reconciliation"" namespace=argo numObjs=111 workflow=my-workflow-xxxx
level=info msg=""Processing workflow"" Phase=Running ResourceVersion=XXXXXXXXXXXXXXXXXXX namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=warning msg=""Deadline exceeded"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx

Logs from in your workflow's wait container

n/a

seanmfr avatar Nov 12 '25 13:11 seanmfr

Seeing a similar issue happen specifically with argo-managed StepGroups created by retryStrategys with identical behavior, that is also completely intermittent, with restarts of the controller failing to reconcile the state.

juliajohannesen avatar Nov 24 '25 19:11 juliajohannesen