argo-workflows
argo-workflows copied to clipboard
Daemon pods keep running after workflow DAG fails when using failFast: true
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
I expect the daemon pod to be terminated when the workflow fails, but that's not the case. The workflow is correctly marked as failed but the daemon pod keeps running until the workflow is deleted. I think it tries to delete the daemon, but it's getting a 404 response (from controller):
time="2023-01-05T18:17:43.909Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.914Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
time="2023-01-05T18:17:43.915Z" level=info msg="Delete pods 404"
Some other notes:
- This is unrelated to this issue, but should daemon containers count towards the dag parallelism? In this case I wanted parallelism of one, but if that's set the workflow gets stuck running just the daemon task
- Without
failFast
the daemon pod is properly deleted - Even with just one item in
withParams
, the daemon pod is not properly deleted if it fails.
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: daemon-nginx-
namespace: argo
spec:
entrypoint: daemon-nginx-example
templates:
- name: daemon-nginx-example
failFast: true
parallelism: 2
dag:
tasks:
- name: nginx-server
template: nginx-server
- name: nginx-client
template: nginx-client
depends: "nginx-server"
withParam: |
["one", "two"]
arguments:
parameters:
- name: server-ip
value: "{{tasks.nginx-server.ip}}"
- name: nginx-server
daemon: true
container:
image: nginx:1.13
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 2
timeoutSeconds: 1
- name: nginx-client
inputs:
parameters:
- name: server-ip
container:
image: appropriate/curl:latest
command: ["/bin/sh", "-c"]
# Fail
args: ["aaaaaaaaaa"]
Logs from the workflow controller
time="2023-01-05T18:17:03.870Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="Updated phase -> Running" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="DAG node daemon-nginx-7m8fc initialized Running" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-server dependencies [] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="Pod node daemon-nginx-7m8fc-1217350964 initialized Pending" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg="Created pod: daemon-nginx-7m8fc.nginx-server (daemon-nginx-7m8fc-nginx-server-1217350964)" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.890Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807268 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.886Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Node became daemoned" namespace=argo nodeId=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=daemon-nginx-7m8fc-1217350964 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="TaskGroup node daemon-nginx-7m8fc-3902071824 initialized Running (message: )" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-client(0:one) dependencies [nginx-server] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Pod node daemon-nginx-7m8fc-3898481205 initialized Pending" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="Created pod: daemon-nginx-7m8fc.nginx-client(0:one) (daemon-nginx-7m8fc-nginx-client-3898481205)" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-client(1:two) dependencies [nginx-server] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="template (node daemon-nginx-7m8fc) active children parallelism exceeded 2" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.899Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807303 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 127)" new.phase=Failed new.progress=0/1 nodeID=daemon-nginx-7m8fc-3898481205 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc phase Running -> Failed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc message: template has failed or errored children and failFast enabled" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc finished: 2023-01-05 18:17:23.891758459 +0000 UTC" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=error msg="error in entry template execution" error="Max parallelism reached" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.895Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807339 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.900Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/daemon-nginx-7m8fc-nginx-client-3898481205/labelPodCompleted
time="2023-01-05T18:17:33.895Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Updated message -> template has failed or errored children and failFast enabled" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.901Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
time="2023-01-05T18:17:33.908Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=807359 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.914Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
Logs from in your workflow's wait container
time="2023-01-05T18:17:17.185Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-01-05T18:17:17.188Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-05T18:17:17.188Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-nginx-7m8fc-nginx-client-3898481205 template="{\"name\":\"nginx-client\",\"inputs\":{\"parameters\":[{\"name\":\"server-ip\",\"value\":\"10.244.0.13\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"appropriate/curl:latest\",\"command\":[\"/bin/sh\",\"-c\"],\"args\":[\"aaaaaaaaaa\"],\"resources\":{}}}" version="&Version{Version:untagged,BuildDate:2023-01-05T16:21:00Z,GitCommit:0f58387c79728b84037aa96221d1c97a974402a4,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-05T18:17:17.188Z" level=info msg="Starting deadline monitor"
time="2023-01-05T18:17:20.190Z" level=info msg="Main container completed" error="<nil>"
time="2023-01-05T18:17:20.190Z" level=info msg="Deadline monitor stopped"
time="2023-01-05T18:17:20.190Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-01-05T18:17:20.190Z" level=info msg="No output parameters"
time="2023-01-05T18:17:20.190Z" level=info msg="No output artifacts"
time="2023-01-05T18:17:20.190Z" level=info msg="Alloc=6340 TotalAlloc=12280 Sys=19666 NumGC=4 Goroutines=5"
time="2023-01-05T18:17:06.942Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-01-05T18:17:06.944Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-05T18:17:06.944Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-nginx-7m8fc-nginx-server-1217350964 template="{\"name\":\"nginx-server\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"daemon\":true,\"container\":{\"name\":\"\",\"image\":\"nginx:1.13\",\"resources\":{},\"readinessProbe\":{\"httpGet\":{\"path\":\"/\",\"port\":80},\"initialDelaySeconds\":2,\"timeoutSeconds\":1}}}" version="&Version{Version:untagged,BuildDate:2023-01-05T16:21:00Z,GitCommit:0f58387c79728b84037aa96221d1c97a974402a4,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-05T18:17:06.944Z" level=info msg="Starting deadline monitor"