prefect
prefect copied to clipboard
Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?)
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar request and didn't find it.
- [X] I searched the Prefect documentation for this feature.
Prefect Version
2.x
Describe the proposed behavior
If there is an issue with the k8s agent setup (e.g. an instance of a kubernetes-job
block references the incorrect cluster service account) such that the pod is never able to start, we could surface job
events as kubectl describe job tuscan-flamingojqdmg-
would:
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 3m30s (x27 over 134m) job-controller Error creating: pods "tuscan-flamingojqdmg-" is forbidden: error looking up service account prefect-2/prefect-agent-1659997862: serviceaccount "prefect-agent-1659997862" not found
... and mark the corresponding flow as Failed
Describe the current behavior
Agent logs just state that pod never started and flow stays in Pending
state
23:04:48.021 | DEBUG | prefect.agent - Checking for flow runs...
23:04:48.147 | INFO | prefect.agent - Submitting flow run 'cd17f57d-13ce-4a8c-92ef-475d0e1067d4'
23:04:53.153 | DEBUG | prefect.agent - Checking for flow runs...
23:04:56.097 | ERROR | prefect.infrastructure.kubernetes-job - Job 'tuscan-flamingojqdmg-': Pod never started.
Example Use
No response
Additional context
using the helm chart for k8s agent as found here
Related to PrefectHQ/prefect-kubernetes#90 and PrefectHQ/prefect#5489
Maybe we should use the same pattern as the ECS block where wait until the task starts and do not report the task as started
until that occurs:
- https://github.com/PrefectHQ/prefect-aws/blob/8f704436c64858d7e43254ab5f8b6249eea984e3/prefect_aws/ecs.py#L465-L469
- https://github.com/PrefectHQ/prefect-aws/blob/8f704436c64858d7e43254ab5f8b6249eea984e3/prefect_aws/ecs.py#L375-L376
We experience same issues on 2.3.2. Would be cool to have it reported as failed and also get logs of that back in the run.
I thought I'd have time to work on this but did not. This is open for contribution.