argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Evicted pod does not cause Argo step to fail

Open moveman opened this issue 1 year ago • 1 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issues exists when I tested with :latest
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We are using spot instance to run the pod of argo step. When spot instance is evicted, pod of argo step cannot finish normally. However, Argo does not treat that step as failed. From Argo json log, we could see finishedAt is set to null but the phase is still set to Succeeded.

  "f070cba2-9f28-477f-8d79-6e6869411703-3016397459": {
                "id": "f070cba2-9f28-477f-8d79-6e6869411703-3016397459",
                "name": "engine",
                "displayName": "engine",
                "type": "Pod",
                "templateName": "engine",
                "templateScope": "local/f070cba2-9f28-477f-8d79-6e6869411703",
                "phase": "Succeeded",
                "boundaryID": "f070cba2-9f28-477f-8d79-6e6869411703-1843839738",
                "startedAt": "2023-11-16T08:44:35Z",
                "finishedAt": null,
}

We encountered this issue frequently recently on AKS.

Version

v3.4.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Any workflow

Logs from the workflow controller

rotated...

Logs from in your workflow's wait container

rotated...

moveman avatar Nov 20 '23 10:11 moveman

Can you test it in latest? I have also encountered insufficient spot instance inventory before, corresponding to pod Failed but with no message. So infer failreason is nil and markd pod succeed. It may be the same reason. I fix it in https://github.com/argoproj/argo-workflows/pull/12197

shuangkun avatar Dec 22 '23 08:12 shuangkun

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

github-actions[bot] avatar Feb 03 '24 02:02 github-actions[bot]

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

github-actions[bot] avatar Feb 17 '24 02:02 github-actions[bot]

Thanks @shuangkun , will try

moveman avatar Apr 23 '24 02:04 moveman