argo-workflows
argo-workflows copied to clipboard
workflow stuck `Running` state, but only pod is `Completed`
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest - [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
the workflow is running for more than 20 hours even though i have activedeadlineseconds set at 12 hours
the workflow just has a single step which also shows as 'running' in the argo ui. but looking at the logs of it shows that it has complete the code that i expect for that step and also shows
time="2023-10-29T00:53:09 UTC" level=info msg="sub-process exited" argo=true error="<nil>" at the end of the main log. the pod itself for that step is in Completed state.
there are other workflows that have completed as expected during this time, and no other workflows running right now. note this exact workflow has successfully run 1000s of times in the past so i know my spec/permissions are correct.
Version
3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: mywf
namespace: auth
spec:
schedule: "23,53 * * * *"
timezone: "UTC"
concurrencyPolicy: "Forbid"
startingDeadlineSeconds: 0
workflowSpec:
entrypoint: main
priority: 100
serviceAccountName: argo-workflows-server
onExit: exit-handler
templates:
- name: main
inputs:
artifacts:
- name: watchdog
path: /s.sh
raw:
data: |
#echo bla...redact
volumes:
- name: aws-iam-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
container:
image: python:3.10 #redacted real one
command: [bash, -c]
args: ["bash /s.sh; RC=$?;if [ $RC -ne 0 ]; then exit $RC; fi;"]
envFrom:
- configMapRef:
name: approle-config
env:
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
volumeMounts:
- name: aws-iam-token
mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
- name: exit-handler
steps:
- - name: notifyError
template: sendmail
arguments:
parameters:
- name: body
value: ""
- name: subject
value: "mywf Failed"
- name: recipients
value: "redact"
- name: workflowname
value: "{{workflow.name}}"
when: "{{workflow.status}} != Succeeded"
- name: sendmail
inputs:
parameters:
- name: body
- name: subject
- name: recipients
- name: workflowname
outputs: {}
metadata: {}
script:
name: ''
image: python:3.10 #redacted real one
command:
- python
env:
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
resources: {}
volumeMounts:
- name: aws-iam-token
mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
source: |
#redact
volumes:
- name: aws-iam-token
projected:
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
defaultMode: 420
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
{"time":"2023-10-29T00:53:00.052668744Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.052Z\" level=info msg=\"Processing workflow\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.065094164Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.064Z\" level=info msg=\"Updated phase -> Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.378202815Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.378Z\" level=warning msg=\"Node was nil, will be initialized as type Skipped\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.379556102Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.379Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.379567612Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.379Z\" level=info msg=\"Retry node mywf-1698540780 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.380135805Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.380Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.380143135Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.380Z\" level=info msg=\"Pod node mywf-1698540780-3299532583 initialized Pending\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.380166815Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.380Z\" level=error msg=\"was unable to obtain node for \" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.426801702Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.426Z\" level=info msg=\"Created pod: mywf-1698540780(0) (mywf-1698540780-main-3299532583)\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.426863752Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.426Z\" level=info msg=\"TaskSet Reconciliation\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.426870253Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.426Z\" level=info msg=reconcileAgentPod namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.452249327Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.452Z\" level=info msg=\"Workflow update successful\" namespace=auth phase=Running resourceVersion=93158499 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.053730206Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.053Z\" level=info msg=\"Processing workflow\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.056058783Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.055Z\" level=info msg=\"Task-result reconciliation\" namespace=auth numObjs=0 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.056253675Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.056Z\" level=info msg=\"node changed\" namespace=auth new.message= new.phase=Running new.progress=0/1 nodeID=mywf-1698540780-3299532583 old.message= old.phase=Pending old.progress=0/1 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057704775Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=info msg=\"node mywf-1698540780 message: retryStrategy.expression evaluated to false\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057733486Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=error msg=\"was unable to obtain node for \" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057741936Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=info msg=\"TaskSet Reconciliation\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057748466Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=info msg=reconcileAgentPod namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.062072958Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.061Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=auth/mywf-1698540780-main-3299532583/terminateContainers
{"time":"2023-10-29T00:53:10.062458361Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.062Z\" level=info msg=\"https://172.20.0.1:443/api/v1/namespaces/auth/pods/mywf-1698540780-main-3299532583/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false\"
{"time":"2023-10-29T00:53:10.071843271Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.071Z\" level=info msg=\"Workflow update successful\" namespace=auth phase=Running resourceVersion=93158593 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.357968789Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.357Z\" level=info msg=\"signaled container\" container=wait error=\"command terminated with exit code 137\" namespace=auth pod=mywf-1698540780-main-3299532583 stderr=\"<nil>\" stdout=\"<nil>\"
{"time":"2023-10-29T00:53:20.285845171Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.285Z\" level=info msg=\"Processing workflow\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.288676072Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.288Z\" level=info msg=\"Task-result reconciliation\" namespace=auth numObjs=1 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.288701152Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.288Z\" level=info msg=\"task-result changed\" namespace=auth nodeID=mywf-1698540780-3299532583 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.288863694Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.288Z\" level=info msg=\"node changed\" namespace=auth new.message= new.phase=Succeeded new.progress=0/1 nodeID=mywf-1698540780-3299532583 old.message= old.phase=Running old.progress=0/1 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290580037Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"node mywf-1698540780 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290606277Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"node mywf-1698540780 finished: 2023-10-29 00:53:20.290503226 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290612707Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"TaskSet Reconciliation\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290616927Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=reconcileAgentPod namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290656267Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"Running OnExit handler: exit-handler\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290670107Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=warning msg=\"Node was nil, will be initialized as type Skipped\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425590775Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425597845Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"Retry node mywf-1698540780-3480973695 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425894928Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425900348Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"Steps node mywf-1698540780-4129110530 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425941598Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"StepGroup node mywf-1698540780-528111236 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426123869Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Skipping mywf-1698540780.onExit(0)[0].notifyError: when 'Succeeded != Succeeded' evaluated false\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42614727Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Skipped node mywf-1698540780-1020989075 initialized Skipped (message: when 'Succeeded != Succeeded' evaluated false)\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42616311Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Step group node mywf-1698540780-528111236 successful\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42617424Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-528111236 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42624667Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-528111236 finished: 2023-10-29 00:53:20.42614426 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426273071Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Outbound nodes of mywf-1698540780-1020989075 is [mywf-1698540780-1020989075]\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426285391Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Outbound nodes of mywf-1698540780-4129110530 is [mywf-1698540780-1020989075]\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426295761Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-4129110530 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426305471Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-4129110530 finished: 2023-10-29 00:53:20.42624932 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427278588Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"node mywf-1698540780-3480973695 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427298888Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"node mywf-1698540780-3480973695 message: retryStrategy.expression evaluated to false\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427305358Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"node mywf-1698540780-3480973695 finished: 2023-10-29 00:53:20.427221048 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427334099Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427365589Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow completed\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427425319Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.43283518Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.432Z\" level=info msg=\"cleaning up pod\" action=deletePod key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:20.440336606Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.440Z\" level=warning msg=\"Error updating workflow: Unauthorized Unauthorized\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.440382846Z","stream":"stderr","_p":"F","log":"E1029 00:53:20.440195 1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"mywf-1698540780.17926ddbd5f9adba\", GenerateName:\"\", Namespace:\"auth\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ZZZ_DeprecatedClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Workflow\", Namespace:\"auth\", Name:\"mywf-1698540780\", UID:\"58fcb020-71bb-48ae-a170-add3f2ad283e\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"93158593\", FieldPath:\"\"}, Reason:\"WorkflowSucceeded\", Message:\"Workflow completed\", Source:v1.EventSource{Component:\"workflow-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), LastTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Unauthorized' (will not retry!)
{"time":"2023-10-29T00:53:20.441209382Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.441Z\" level=warning msg=\"failed to clean-up pod\" action=deletePod error=Unauthorized key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:40.359148369Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:40.358Z\" level=info msg=\"cleaning up pod\" action=killContainers key=auth/mywf-1698540780-main-3299532583/killContainers
{"time":"2023-10-29T00:56:00.886969281Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:56:00.886Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T01:16:00.887399312Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T01:16:00.887Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T01:36:00.888193298Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T01:36:00.888Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T01:56:00.888829659Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T01:56:00.888Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T02:36:00.890695997Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T02:36:00.890Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T02:56:00.89116664Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T02:56:00.890Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T03:16:00.891733664Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T03:16:00.891Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T03:36:00.892377753Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T03:36:00.892Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T03:56:00.89365189Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T03:56:00.893Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T04:16:00.894685742Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T04:16:00.894Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T04:36:00.894854677Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T04:36:00.894Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T04:56:00.89594993Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T04:56:00.895Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T05:16:00.896569173Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T05:16:00.896Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T05:36:00.897468251Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T05:36:00.897Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T05:56:00.897487112Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T05:56:00.897Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T06:16:00.897956601Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T06:16:00.897Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T06:36:00.899615191Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T06:36:00.899Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T06:56:00.900653158Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T06:56:00.900Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T07:16:00.901689127Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T07:16:00.901Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T07:36:00.901062144Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T07:36:00.900Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T07:56:00.90163483Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T07:56:00.901Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T08:16:00.902643172Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T08:16:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T08:36:00.902640524Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T08:36:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T08:56:00.902906351Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T08:56:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T09:16:00.902743721Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T09:16:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T09:36:00.903334761Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T09:36:00.903Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T09:56:00.904282091Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T09:56:00.904Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T10:16:00.904660693Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T10:16:00.904Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T10:36:00.9054561Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T10:36:00.905Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T10:56:00.906083764Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T10:56:00.905Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T11:36:00.907986689Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T11:36:00.907Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T11:56:00.908159738Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T11:56:00.907Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T12:16:00.909456184Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T12:16:00.909Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T12:36:00.909439313Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T12:36:00.909Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T12:56:00.91060522Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T12:56:00.910Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T13:16:00.911304788Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T13:16:00.911Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T13:36:00.911790482Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T13:36:00.911Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T13:56:00.912257921Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T13:56:00.912Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T14:16:00.913294595Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T14:16:00.913Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T14:36:00.914315405Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T14:36:00.914Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T14:56:00.91418568Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T14:56:00.914Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T15:16:00.915084611Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T15:16:00.914Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T15:36:00.915637205Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T15:36:00.915Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T15:56:00.915322853Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T15:56:00.915Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T16:16:00.916672011Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T16:16:00.916Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T16:36:00.916635785Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T16:36:00.916Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T16:56:00.917995237Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T16:56:00.917Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T17:16:00.918616725Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T17:16:00.918Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T17:36:00.919663016Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T17:36:00.919Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T17:56:00.920609693Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T17:56:00.920Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T18:16:00.92075436Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T18:16:00.920Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T18:36:00.921621236Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T18:36:00.921Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T18:56:00.922215871Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T18:56:00.922Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T19:16:00.922594784Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T19:16:00.922Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T19:36:00.923035111Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T19:36:00.922Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T19:56:00.924160017Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T19:56:00.923Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T20:16:00.924556526Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T20:16:00.924Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T20:36:00.925492008Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T20:36:00.925Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T20:56:00.9260603Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T20:56:00.925Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T21:16:00.926241513Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T21:16:00.926Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T21:36:00.927554494Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T21:36:00.927Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
time="2023-10-29T00:53:10.087Z" level=info msg="No output parameters"
time="2023-10-29T00:53:10.087Z" level=info msg="No output artifacts"
time="2023-10-29T00:53:10.087Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: argo_wf_logs/2023/10/29/00/53/mywf-1698540780/mywf-1698540780-main-3299532583/main.log"
time="2023-10-29T00:53:10.087Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-10-29T00:53:10.137Z" level=info msg="Saving file to s3" bucket=redactbuc endpoint=s3.amazonaws.com key=argo_wf_logs/2023/10/29/00/53/mywf-1698540780/mywf-1698540780-main-3299532583/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-10-29T00:53:10.269Z" level=info msg="Save artifact" artifactName=main-logs duration=181.217406ms error="<nil>" key=argo_wf_logs/2023/10/29/00/53/mywf-1698540780/mywf-1698540780-main-3299532583/main.log
time="2023-10-29T00:53:10.269Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-10-29T00:53:10.269Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-10-29T00:53:10.285Z" level=info msg="Alloc=9408 TotalAlloc=18061 Sys=43645 NumGC=5 Goroutines=12"
time="2023-10-29T00:53:10.286Z" level=info msg="Deadline monitor stopped"
the pod itself for that step is in Completed state.
So, to summarize, the Pod is "Completed" but the Step and the Workflow are both still showing as "Running", correct?
I'm imagining that the Controller is failing to process it (especially as it has surpassed the activeDeadlineSeconds) or the Executor isn't reporting it correctly.
Since it happens very infrequently, this sounds like a very rare race condition.
correct. there is something in workflow controller logs below that caught my eye and makes me think its missing retry logic when receiving transient error from k8s control plane:
{"time":"2023-10-29T00:53:20.427334099Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427365589Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow completed\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427425319Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.43283518Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.432Z\" level=info msg=\"cleaning up pod\" action=deletePod key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:20.440336606Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.440Z\" level=warning msg=\"Error updating workflow: Unauthorized Unauthorized\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.440382846Z","stream":"stderr","_p":"F","log":"E1029 00:53:20.440195 1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"mywf-1698540780.17926ddbd5f9adba\", GenerateName:\"\", Namespace:\"auth\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ZZZ_DeprecatedClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Workflow\", Namespace:\"auth\", Name:\"mywf-1698540780\", UID:\"58fcb020-71bb-48ae-a170-add3f2ad283e\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"93158593\", FieldPath:\"\"}, Reason:\"WorkflowSucceeded\", Message:\"Workflow completed\", Source:v1.EventSource{Component:\"workflow-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), LastTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Unauthorized' (will not retry!)
{"time":"2023-10-29T00:53:20.441209382Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.441Z\" level=warning msg=\"failed to clean-up pod\" action=deletePod error=Unauthorized key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:40.359148369Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:40.358Z\" level=info msg=\"cleaning up pod\" action=killContainers key=auth/mywf-1698540780-main-3299532583/killContainers
We've had this happen on 8 jobs yesterday. We've noticed that for each one of those:
status.conditions[].PodRunningwas set to False butstatus.conditions[].Completedwas missing.workflows.argoproj.io/completedwas "false"- finishedAt was set to
null - The Workflow's
phaseis marked asSucceeded - The Workflows's
progressis however showing51/52 taskResultsCompletionStatusof the specific task was set toFalse- The
WorkflowTaskResult'sworkflows.argoproj.io/report-outputs-completedwas also"false" - The failed node's
outputs.exitCodewas"0" - Its
phasewas"Error" - Its
progresswas"0/1" - Its
messagewasmessage: OOMKilled (exit code 137)
The problem is exacerbated by the Deadline and it snowballs on following jobs as they get stuck in Pending with the following message "Workflow processing has been postponed because too many workflows are already running"
same problem in 3.5.2 the pod Completed but the workflow is Running or even not display the status (wait container log seem fine). any suggestion?
@ZeidAqemia @zqhi71 do your controller logs have any errors like the ones in https://github.com/argoproj/argo-workflows/issues/12103#issuecomment-1784297997 ? what % of workflows are affected? what version are u running?
for me on v3.4.11 this affects less than 0.01% of workflows
@ZeidAqemia @zqhi71 do your controller logs have any errors like the ones in #12103 (comment) ? what % of workflows are affected? what version are u running?
for me on v3.4.11 this affects less than 0.01% of workflows
v3.5.2 here and it's 100% of workflows where one task OOMs
Hi guys. We found just the default controller settings is not suitable for thounds of cronworkflow. When we adjust --cronworkflow-worker, qps and burst, the cronworkflow works fine. If someone have the same problem maybe adjust settings following this documents (https://argo-workflows.readthedocs.io/en/latest/scaling/) will help.
This seems related to Hang on "Workflow processing has been postponed due to max parallelism limit" #11808
I'm seeing the same issue when using namespaceParallelism
Hi @tooptoop4 Were you able to make any progress on this issue, or is it still reoccurring?
it still reoccurs but rare: roughly 1 in 20000 workflow runs. did u see the log in https://github.com/argoproj/argo-workflows/issues/12103#issuecomment-1784297997 ?
it still reoccurs but rare: roughly 1 in 20000 workflow runs. did u see the log in https://github.com/argoproj/argo-workflows/issues/12103#issuecomment-1784297997 ?
Yes, but in my case, we are facing the same stuck in Running but not seeing any server issue related message from the log 😭
i wonder if https://github.com/argoproj/argo-workflows/pull/12233 helps @tczhao
Hi @tooptoop4, we tried https://github.com/argoproj/argo-workflows/pull/12233 but it doesn't help in my case. I highly suspect the root cause in my case is due to https://github.com/argoproj/argo-workflows/issues/12997, we typically observe childnode missing when workflow stuck in running in my situation Should have a PR ready in a few days with more explanation
We also observed this in a small portion of our workflows.
The workflows all have 5 tasks defined in their spec. On the ones that are running, in the status there are suddenly 6 or 7 workflow results.
We also did a log search. The 5 tasks that are reported as completely are found in the controller logs. For the 2 additional tasks that are not finished, there is no single log. In the UI there is also no trace of them. Or in any other status field.
We are using version v3.5.6
any idea? @jswxstw @shuangkun
any idea? @jswxstw @shuangkun
After reading the background of the issue, and it seems that your situation is different from others.(not similar to #12993)
I see a lot of logs like: {"time":"2023-10-29T00:56:00.886969281Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:56:00.886Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
Does the workflow stuck in Running state failed to be processed because of it?
We're also seeing this issue only in 3.5.x versions. I initially tried to upgrade and saw this issue on an earlier 3.5.x. It's been a month or so, so I tried again with 3.5.8, and I'm still seeing the issue. This is with any workflow I try to run - steps, dags, containers and both invoked from workflow templates or crons (although I doubt that matters).
@sstaley-hioscar, could you verify what you see in the wait container logs from one of these runs confirms it is using workflowtaskresults, and that the controller itself has appropriate RBAC to read the workflowtaskresults as you've said you have custom RBAC.
@Joibel
here are some logs from the wait container:
time="2024-06-22T17:12:38.814Z" level=info msg="Starting Workflow Executor" version=v3.4.8
time="2024-06-22T17:12:38.819Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
...
time="2024-06-22T17:12:40.821Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
It looks like it's using the wrong version. I'll look into that.
@Joibel It looks like that was the issue. I can delete some of those example templates to prevent cluttering up this thread, if you like.
It looks like I'm running into a new error though. Now workflows is now attempting to use the service account of the namespace the workflow is running in to patch pods:
\"system:serviceaccount:monitoring:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"monitoring\""
which wasn't what our rbac was set up for in the previous version. Is this expected new behavior or is there a configuration I need to set to make the controller use its own token for these API calls?
It looks like it's using the wrong version. I'll look into that.
@sstaley-hioscar This is the root cause, because wait container with version v3.4.8 does not write the LabelKeyReportOutputsCompleted label in WorkflowTaskResult.
When upgrading from version 3.5.1 or below directly to version 3.5.5 or above, if there are running workflows in the cluster, these workflows will stuck in Running, even though their pods are Completed.
It looks like it's using the wrong version. I'll look into that.
@sstaley-hioscar This is the root cause, because wait container with version
v3.4.8does not write theLabelKeyReportOutputsCompletedlabel inWorkflowTaskResult.When upgrading from version 3.5.1 or below directly to version 3.5.5 or above, if there are running workflows in the cluster, these workflows will stuck in
Running, even though their pods areCompleted.
If you're upgrading from a version which does not record TaskResultCompletionStatus in the status block of the workflow, to one that does, the nodes will remain in Running despite the pods being Completed.
This is because of this choice from #12537, which means missing TaskResultCompletionStatus entries are always going to be regarded as incomplete.
This blocks the controller from making any progress, and means upgrades over this with running workflows will always fail to complete inflight workflows.
I found other issue, the task of the wf had been cleaned, but the wf is always running.
root@10-16-10-122:/home/devops# kubectl get wf -n argo-map mapping-pipeline-1524786-1720665764
NAME STATUS AGE MESSAGE
mapping-pipeline-1524786-1720665764 Running 14h
root@10-16-10-122:/home/devops# kubectl get pods -n argo-map | grep mapping-pipeline-1524786-1720665764
root@10-16-10-122:/home/devops#
root@10-16-10-122:/home/devops#
root@10-16-10-122:/home/devops#
the "taskResultsCompletionStatus" of wf is:
taskResultsCompletionStatus:
mapping-pipeline-1524786-1720665764-1391268475: false
mapping-pipeline-1524786-1720665764-3212493707: false
logs of the argo controller is:
time="2024-07-12T03:31:08.133Z" level=debug msg="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed" fromPhase=Running namespace=argo-map toPhase=Succeeded workflow=mapping-pipeline-1524786-1720665764
workflow stuck in Running state, even though the only pod for it is Error.
root@10-16-10-122:/home/devops# kubectl get wf -n argo-map-benchmark localization-benchmark-v2-7101-1720600493
NAME STATUS AGE MESSAGE
localization-benchmark-v2-7101-1720600493 Running 2d
root@10-16-10-122:/home/devops# kubectl get pods -n argo-map-benchmark | grep localization-benchmark-v2-7101-1720600493
localization-benchmark-v2-7101-1720600493-single-1265870429 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-129032789 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-1871596469 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-2269257704 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-2366564762 0/2 UnexpectedAdmissionError 0 2d
localization-benchmark-v2-7101-1720600493-single-2375535511 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-2612496173 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-2626411911 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-2769513904 0/2 UnexpectedAdmissionError 0 2d
localization-benchmark-v2-7101-1720600493-single-3116854153 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-3308776981 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-3926467371 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-4172479490 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-4220742789 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-single-4221498464 0/2 UnexpectedAdmissionError 0 2d
localization-benchmark-v2-7101-1720600493-single-45705550 0/2 Error 0 2d
localization-benchmark-v2-7101-1720600493-summary-406718820 0/2 UnexpectedAdmissionError 0 47h
https://github.com/argoproj/argo-workflows/blob/5aac5a8f61f4e8273d04509dffe7d80123ff67f5/workflow/controller/taskresult.go#L67 If the status of the pod is error, I think the taskresults is completed.
We're experiencing the same issue, I believe (v3.5.7). Using Terminate or Stop does no help.
What action should we take in these cases? Is simply deleting the Workflow a safe approach, if we can?
https://github.com/argoproj/argo-workflows/pull/13332 it can not fix the issue, when the argo version upgrade from v3.4.9 to v3.5.8. the pod status is completed, but the task under the wf of taskResultsCompletionStatus is always 'false'. @Joibel
other issue: pod is completed,but the status of the task of wf is 'false', and the status of the wf is running
taskResultsCompletionStatus:
prod--prod--lt-filter-1-1-95--2692e7e1-30e5-44a5-977a-d5c32dmh5-2722797653: false
root@10-16-10-122:/home/devops# kubectl get pods -n argo-data-closeloop | grep -i comple
prod--prod--lt-filter-1-1-95--2692e7e1-30e5-44a5-977a-d5c32dmh5-job-entrypoint-2722797653 0/2 Completed 0 14d
I don't agree the pr https://github.com/argoproj/argo-workflows/pull/13332 can fix it.
We're experiencing the same issue, I believe (
v3.5.7). UsingTerminateorStopdoes no help. What action should we take in these cases? Is simply deleting the Workflow a safe approach, if we can?
I usually use the following command to recover the stuck workflow:
# find the name of workflowtaskresult belong to the completed pod.
kubectl label workflowtaskresult ${taskResultName} workflows.argoproj.io/report-outputs-completed=true
@stefanondisponibile Maybe you can try it, rather than deleting the workflow.
https://github.com/argoproj/argo-workflows/pull/13332 it can not fix the issue, when the argo version upgrade from v3.4.9 to v3.5.8. the pod status is completed, but the task under the wf of taskResultsCompletionStatus is always 'false'.
@zhucan You are right. taskResultsCompletionStatus is always false if label workflows.argoproj.io/report-outputs-completed is false or missing.
https://github.com/argoproj/argo-workflows/blob/d7495b83b519e0c39b49fe692485e95286ce6665/workflow/controller/taskresult.go#L66-L73
other issue: pod is completed,but the status of the task of wf is 'false', and the status of the wf is running
taskResultsCompletionStatus: prod--prod--lt-filter-1-1-95--2692e7e1-30e5-44a5-977a-d5c32dmh5-2722797653: false root@10-16-10-122:/home/devops# kubectl get pods -n argo-data-closeloop | grep -i comple prod--prod--lt-filter-1-1-95--2692e7e1-30e5-44a5-977a-d5c32dmh5-job-entrypoint-2722797653 0/2 Completed 0 14dI don't agree the pr #13332 can fix it.
I agree. It won't fix an issue of taskResultsCompletionStatus being false. That's often caused by missing/incorrect RBAC.
Other thoughts: you may have set your executor to an old version?
PR #13332 only fixes the issue of taskResultsCompletionStatus being completely missing for a node.
Check the executor (wait container) logs and check you're getting WorkflowTaskResults being created by your pods.