argo-workflows GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed

Open pjo256 opened this issue 1 year ago • 1 comments

Pre-requisites

[X] I have double-checked my configuration
[ ] I can confirm the issues exists when I tested with :latest
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

For some context, we run a large number of jobs with preemptible node pools in GKE. We typically don't want to retry application failures, but we always want to retry preemptions. We added a retry policy for OnError, and successfully triggered this policy when manually pre-empting nodes (e.g. deleting a node in the node pool).

When looking at Argo workflows in the cluster, manual preemption was recorded as a WorkflowNodeError with a pod deleted error message and correctly triggered our retry policy. During actual pre-emption, we see WorkflowNodeFailed - Pod was terminated due to imminent node shutdown and does not trigger our retry policy.

After https://github.com/argoproj/argo-workflows/issues/2881, it seems all preemptions should be treated as WorkflowNodeError. This would help isolate retries to node availability problems as opposed to application errors.

Version

v3.3.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Should be any workflow running in GKE

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2022-10-19T17:06:08.550Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="Updated phase  -> Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="DAG node ${workflow} initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="All of node ${workflow}.exit-handler-1 dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="DAG node ${workflow}-4115780724 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="All of node ${workflow}.exit-handler-1.chaptering dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Retry node ${workflow}-3870252585 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Pod node ${workflow}-1579747344 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="Created pod: ${workflow}.exit-handler-1.chaptering(0) (${workflow}-1579747344)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.647Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478544 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="node changed" new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=${workflow}-1579747344 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:23.599Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.599Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.621Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478634 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="node changed" new.message="Pod was terminated in response to imminent node shutdown." new.phase=Failed new.progress=0/1 nodeID=${workflow}-1579747344 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:33.625Z" level=info msg="node has maxDuration set, setting executionDeadline to: Mon Jan 01 00:00:00 +0000 (a long while ago)" namespace=default node=${workflow}.exit-handler-1.chaptering workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Node not set to be retried after status: Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 message: Pod was terminated in response to imminent node shutdown." namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 finished: 2022-10-19 17:06:33.625741142 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow}-4115780724 set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 finished: 2022-10-19 17:06:33.625828677 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}-4115780724" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow} set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} finished: 2022-10-19 17:06:33.62593395 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Pod node ${workflow}-1999402710 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.650Z" level=info msg="Created pod: ${workflow}.onExit (${workflow}-1999402710)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.673Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478796 workflow=${workflow}
time="2022-10-19T17:06:38.679Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1579747344/labelPodCompleted
time="2022-10-19T17:06:48.649Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="node changed" new.message= new.phase=Succeeded new.progress=0/1 nodeID=${workflow}-1999402710 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:48.650Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Marking workflow completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Checking daemoned children of " namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.655Z" level=info msg="cleaning up pod" action=deletePod key=default/${workflow}-1340600742-agent/deletePod
time="2022-10-19T17:06:48.664Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=19478961 workflow=${workflow}
time="2022-10-19T17:06:48.673Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1999402710/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Oct 21 '22 15:10 pjo256

Have you tried setting TRANSIENT_ERROR_PATTERN? https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md

Oct 21 '22 16:10 terrytangyuan

Good point, setting TRANSIENT_ERROR_PATTERN does solve my issue. It would be nice to restart on both OnTransientError with this pattern and OnError. I'd like to use conditional retries, but not sure if expression would have access to the imminent node shutdown message on lastRetry - only seeing exitCode, status and duration in https://argoproj.github.io/argo-workflows/retries/#conditional-retries.

Oct 25 '22 15:10 pjo256

@pjo256 Can you create an enhancement request for the above ask?

Oct 31 '22 17:10 sarabala1979

argo-workflows argo-workflows copied to clipboard

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

argo-workflows
argo-workflows copied to clipboard