prefect
prefect copied to clipboard
Handle failed kubernetes scheduling events more gracefully
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar issue and didn't find it.
- [X] I searched the Prefect documentation for this issue.
- [X] I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
When a kubernetes cluster is unable to schedule a pod in time, flow runs will be reported as Crashed but subsequently complete successfully.
Here is the log output from an example run:
Note that Reported flow run 'c605bcd5-8048-4140-bea4-0d04f0a2af2c' as crashed: Flow run infrastructure exited with non-zero status code -1.
occurs after the pod is scheduled successfully.
Reproduction
Repro can be a bit complex but we should be able to understand from k8s events without reproducing
- Start a Kubernetes worker
- Have the worker start a flow run in a cluster with no nodes available
- After ~2 minutes, ensure a node is available on the cluster
Error
No response
Versions
`2.15.0`
Additional context
No response
Workaround for this is to increase the Pod Watch Timeout
on the work pool to a large enough value to ensure the pod has time to schedule