prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Handle failed kubernetes scheduling events more gracefully

Open zangell44 opened this issue 1 year ago • 1 comments

First check

  • [X] I added a descriptive title to this issue.
  • [X] I used the GitHub search to find a similar issue and didn't find it.
  • [X] I searched the Prefect documentation for this issue.
  • [X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

When a kubernetes cluster is unable to schedule a pod in time, flow runs will be reported as Crashed but subsequently complete successfully.

Here is the log output from an example run:

image

Note that Reported flow run 'c605bcd5-8048-4140-bea4-0d04f0a2af2c' as crashed: Flow run infrastructure exited with non-zero status code -1. occurs after the pod is scheduled successfully.

Reproduction

Repro can be a bit complex but we should be able to understand from k8s events without reproducing

- Start a Kubernetes worker
- Have the worker start a flow run in a cluster with no nodes available
- After ~2 minutes, ensure a node is available on the cluster

Error

No response

Versions

`2.15.0`

Additional context

No response

zangell44 avatar Feb 23 '24 14:02 zangell44

Workaround for this is to increase the Pod Watch Timeout on the work pool to a large enough value to ensure the pod has time to schedule

zangell44 avatar May 02 '24 20:05 zangell44