argo-workflows
argo-workflows copied to clipboard
Evicted pod does not cause Argo step to fail
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
We are using spot instance to run the pod of argo step. When spot instance is evicted, pod of argo step cannot finish normally. However, Argo does not treat that step as failed. From Argo json log, we could see finishedAt is set to null but the phase is still set to Succeeded.
"f070cba2-9f28-477f-8d79-6e6869411703-3016397459": {
"id": "f070cba2-9f28-477f-8d79-6e6869411703-3016397459",
"name": "engine",
"displayName": "engine",
"type": "Pod",
"templateName": "engine",
"templateScope": "local/f070cba2-9f28-477f-8d79-6e6869411703",
"phase": "Succeeded",
"boundaryID": "f070cba2-9f28-477f-8d79-6e6869411703-1843839738",
"startedAt": "2023-11-16T08:44:35Z",
"finishedAt": null,
}
We encountered this issue frequently recently on AKS.
Version
v3.4.7
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Any workflow
Logs from the workflow controller
rotated...
Logs from in your workflow's wait container
rotated...
Can you test it in latest? I have also encountered insufficient spot instance inventory before, corresponding to pod Failed but with no message. So infer failreason is nil and markd pod succeed. It may be the same reason. I fix it in https://github.com/argoproj/argo-workflows/pull/12197
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.
Thanks @shuangkun , will try