Workflow stuck in `Running` state after child pod `Succeeded`; parent `StepGroup` not updated.
Pre-requisites
- [x] I have double-checked my configuration
- [x] I have tested with the
:latestimage tag (i.e.quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on:latest. If not, I have explained why, in detail, in my description below. - [x] I have searched existing issues and could not find a match for this bug
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened? What did you expect to happen?
Bug Description:
A running workflow has become permanently "stuck." A child node within a StepGroup has successfully completed its execution and is in the Succeeded phase. However, its parent StepGroup node remains in the Running phase with finishedAt: null.
This inconsistency in the status.nodes tree has halted the workflow, and the controller is not advancing to the subsequent steps defined in the template. The child pod that completed was the first step in its parent Steps template.
Stuck Node Details (from status.nodes):
The parent node ...-1564949926 is stuck Running, even though its only child pod ...-3969013248 has Succeeded.
Child Pod (Completed):
my-workflow-cron-xxxx-3969013248:
boundaryID: my-workflow-cron-xxxx-3665623116
displayName: wait-for-phase-outcome
finishedAt: "2025-11-11T17:29:28Z"
hostNodeName: gke-my-cluster-node-3b1f9489-2rzx
id: my-workflow-cron-xxxx-3969013248
name: my-workflow-cron-xxxx[0].mainline-release-channel[12].promote.promote-core-service[4].create-phases-from-waves[1].create-phase(0:checkAlerts:false,name:wave-176-0,validationWait:30m)[6].wait-for-phase-outcome[0].wait-for-phase-outcome
outputs:
exitCode: "0"
parameters:
- name: success
value: "false"
- name: failure
value: "true"
- name: report
value: |
```2025-11-11T16:18:36.480815Z mainline-candidate-20251111-57d7...:
Criteria - wave-176-0
Failed - 3
fail: timed out after 70 mins + fail: 3 projects failed upgrade
Projects still require upgrading to this version - e.g. in another phase```
phase: Succeeded
progress: 1/1
startedAt: "2025-11-11T16:18:40Z"
templateName: wait-for-phase-outcome
templateScope: namespaced/pubsub
type: Pod
Parent StepGroup (Stuck):
my-workflow-cron-xxxx-1564949926:
boundaryID: my-workflow-cron-xxxx-3665623116
children:
- my-workflow-cron-xxxx-3969013248
displayName: '[0]'
finishedAt: null
id: my-workflow-cron-xxxx-1564949926
name: my-workflow-cron-xxxx[0].mainline-release-channel[12].promote.promote-core-service[4].create-phases-from-waves[1].create-phase(0:checkAlerts:false,name:wave-176-0,validationWait:30m)[6].wait-for-phase-outcome[0]
nodeFlag: {}
phase: Running
progress: 1/1
startedAt: "2025-11-11T16:18:40Z"
templateScope: namespaced/pubsub
type: StepGroup
Template Context:
The stuck pod (...-3969013248) is the first step ([0].wait-for-phase-outcome) inside a Steps template named handle-phase-outcome. This template is, in turn, called by another Steps template (create-phase). The pod's outputs.parameters (e.g., success: "false") should be read by the subsequent steps in the handle-phase-outcome template, but these steps are never executed.
Attempts to Resolve (All Failed):
-
Graceful Controller Restart:
-
kubectl rollout restart deployment/workflow-controller -n argo - Result: The new controller pod started but failed to reconcile the inconsistent state. The workflow remained stuck.
-
-
argo stop:-
argo stop my-workflow-cron-xxxx -n argo -
Result: The command executed, but the workflow object's
phaseremainedRunning.
-
-
argo retry:-
argo retry my-workflow-cron-xxxx -n argo --node-field-selector id=...-3969013248 -
Result: Command failed with an error stating that nodes of a
Runningworkflow cannot be retried.
-
-
Patch Metadata (Force Reconcile):
-
kubectl patch workflow my-workflow-cron-xxxx -n argo --type merge -p '{"metadata":{"annotations":{"unstick-trigger":"1"}}}' - Result: The annotation was applied, but the controller did not correct the state.
-
Expected Behavior:
The workflow shouldn't hang at all. Also, the workflow-controller (especially after a restart) should have detected that pod ...-3969013248 was Succeeded, updated the parent StepGroup ...-1564949926 to Succeeded, and then proceeded to the next steps in the handle-phase-outcome template.
Actual Behavior:
The workflow phase remains Running, and the parent StepGroup phase also remains Running, effectively deadlocking the workflow. The controller appears unable to resolve this status.nodes inconsistency.
Testing on Latest: Reproducing the bug is difficult because it is not deterministic. It manifests sporadically during workflow execution in production under high load. For this reason it hasnt been validated on the latest tag.
Version(s)
3.7.2
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: bug-repro-stuck-step-
spec:
entrypoint: main-steps
templates:
- name: main-steps
steps:
# This first step's StepGroup is what gets stuck in the 'Running' phase
# even after its pod (from 'long-running-pod' template) succeeds.
- - name: step-one
template: long-running-pod
# This step is never scheduled if the bug occurs, because the
# controller believes 'step-one' is still running.
- - name: step-two
template: next-step-that-never-runs
# This template simulates the child pod that runs, succeeds,
# and produces output.
- name: long-running-pod
container:
image: alpine:latest # Publicly available image
command: ["/bin/sh", "-c"]
args:
- |
echo "Starting pod, will run for 60 seconds..."
sleep 60
echo "Pod finished. Creating output."
mkdir -p /tmp/outputs
echo -n 'completed' > /tmp/outputs/status
outputs:
parameters:
- name: status
valueFrom:
path: /tmp/outputs/status
# This template is for the step that should run after 'long-running-pod'
- name: next-step-that-never-runs
container:
image: alpine:latest # Publicly available image
command: ["echo", "--- WORKFLOW UNSTUCK --- This step (step-two) has run."]
This is an example workflow that mimics the behaviour we see in production.
Logs from the workflow controller
logs are sanitised for bug report
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Task-result reconciliation"" namespace=argo numObjs=111 workflow=my-workflow-xxxx
level=info msg=""Processing workflow"" Phase=Running ResourceVersion=XXXXXXXXXXXXXXXXXXX namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=warning msg=""Deadline exceeded"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Task-result reconciliation"" namespace=argo numObjs=111 workflow=my-workflow-xxxx
level=info msg=""Processing workflow"" Phase=Running ResourceVersion=XXXXXXXXXXXXXXXXXXX namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=warning msg=""Deadline exceeded"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...] my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Task-result reconciliation"" namespace=argo numObjs=111 workflow=my-workflow-xxxx
level=info msg=""Processing workflow"" Phase=Running ResourceVersion=XXXXXXXXXXXXXXXXXXX namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""Workflow step group node my-workflow-xxxx-[node-id-...] not yet completed"" namespace=argo workflow=my-workflow-xxxx
level=warning msg=""Deadline exceeded"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
level=info msg=""SG Outbound nodes of my-workflow-xxxx-[node-id-...] are [my-workflow-xxxx-[node-id-...]]"" namespace=argo workflow=my-workflow-xxxx
Logs from in your workflow's wait container
n/a
Seeing a similar issue happen specifically with argo-managed StepGroups created by retryStrategys with identical behavior, that is also completely intermittent, with restarts of the controller failing to reconcile the state.