argo-workflows
argo-workflows copied to clipboard
GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed
Pre-requisites
- [X] I have double-checked my configuration
- [ ] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
For some context, we run a large number of jobs with preemptible node pools in GKE. We typically don't want to retry application failures, but we always want to retry preemptions. We added a retry policy for OnError
, and successfully triggered this policy when manually pre-empting nodes (e.g. deleting a node in the node pool).
When looking at Argo workflows in the cluster, manual preemption was recorded as a WorkflowNodeError
with a pod deleted
error message and correctly triggered our retry policy. During actual pre-emption, we see WorkflowNodeFailed
- Pod was terminated due to imminent node shutdown
and does not trigger our retry policy.
After https://github.com/argoproj/argo-workflows/issues/2881, it seems all preemptions should be treated as WorkflowNodeError
. This would help isolate retries to node availability problems as opposed to application errors.
Version
v3.3.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Should be any workflow running in GKE
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
time="2022-10-19T17:06:08.550Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="Updated phase -> Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="DAG node ${workflow} initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="All of node ${workflow}.exit-handler-1 dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="DAG node ${workflow}-4115780724 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="All of node ${workflow}.exit-handler-1.chaptering dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Retry node ${workflow}-3870252585 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Pod node ${workflow}-1579747344 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="Created pod: ${workflow}.exit-handler-1.chaptering(0) (${workflow}-1579747344)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.647Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478544 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="node changed" new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=${workflow}-1579747344 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:23.599Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.599Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.621Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478634 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="node changed" new.message="Pod was terminated in response to imminent node shutdown." new.phase=Failed new.progress=0/1 nodeID=${workflow}-1579747344 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:33.625Z" level=info msg="node has maxDuration set, setting executionDeadline to: Mon Jan 01 00:00:00 +0000 (a long while ago)" namespace=default node=${workflow}.exit-handler-1.chaptering workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Node not set to be retried after status: Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 message: Pod was terminated in response to imminent node shutdown." namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 finished: 2022-10-19 17:06:33.625741142 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow}-4115780724 set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 finished: 2022-10-19 17:06:33.625828677 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}-4115780724" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow} set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} finished: 2022-10-19 17:06:33.62593395 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Pod node ${workflow}-1999402710 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.650Z" level=info msg="Created pod: ${workflow}.onExit (${workflow}-1999402710)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.673Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478796 workflow=${workflow}
time="2022-10-19T17:06:38.679Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1579747344/labelPodCompleted
time="2022-10-19T17:06:48.649Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="node changed" new.message= new.phase=Succeeded new.progress=0/1 nodeID=${workflow}-1999402710 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:48.650Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Marking workflow completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Checking daemoned children of " namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.655Z" level=info msg="cleaning up pod" action=deletePod key=default/${workflow}-1340600742-agent/deletePod
time="2022-10-19T17:06:48.664Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=19478961 workflow=${workflow}
time="2022-10-19T17:06:48.673Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1999402710/labelPodCompleted
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
Have you tried setting TRANSIENT_ERROR_PATTERN
? https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md
Good point, setting TRANSIENT_ERROR_PATTERN
does solve my issue. It would be nice to restart on both OnTransientError
with this pattern and OnError
. I'd like to use conditional retries, but not sure if expression
would have access to the imminent node shutdown
message on lastRetry
- only seeing exitCode
, status
and duration
in https://argoproj.github.io/argo-workflows/retries/#conditional-retries.
@pjo256 Can you create an enhancement request for the above ask?