flyte
flyte copied to clipboard
[BUG] `Predicate NodeAffinity failed` incorrectly handled as user error on GKE
Describe the bug
On GKE v1.25.10-gke.1200
:
Interruptible task is marked as failed with error message:
[1/1] currentAttempt done. Last Error: USER::Pod Predicate NodeAffinity failed
The following kubelet logs were recovered from the node in question:
{
"timestamp": "2023-08-10T18:00:54Z",
"log": "Starting kubelet."
}
{
"timestamp": "2023-08-10T18:01:16Z",
"log": "Shutdown manager detected shutdown event"
}
{
"timestamp": "2023-08-10T18:03:22Z",
"log": "Starting kubelet."
}
{
"timestamp": "2023-08-10T18:03:24Z",
"log": "Predicate NodeAffinity failed"
}
This suggests that the node in question was originally preempted, and was subsequently recreated with the same name, making this falsely appear as a node restart. Per cluster logs, kubelet forces a delete of the unrecognized pod. This is then incorrectly handled as a user error by propeller.
Expected behavior
Predicate NodeAffinity failed
should be treated as a system failure, similar to node interruption, and the system retry budget should be respected.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes