prefect Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?)

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?)

Open zzstoatzz opened this issue 2 years ago • 2 comments

First check

[X] I added a descriptive title to this issue.
[X] I used the GitHub search to find a similar request and didn't find it.
[X] I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the proposed behavior

If there is an issue with the k8s agent setup (e.g. an instance of a kubernetes-job block references the incorrect cluster service account) such that the pod is never able to start, we could surface job events as kubectl describe job tuscan-flamingojqdmg- would:

....
Events:
  Type     Reason        Age                    From            Message
  ----     ------        ----                   ----            -------
  Warning  FailedCreate  3m30s (x27 over 134m)  job-controller  Error creating: pods "tuscan-flamingojqdmg-" is forbidden: error looking up service account prefect-2/prefect-agent-1659997862: serviceaccount "prefect-agent-1659997862" not found

... and mark the corresponding flow as Failed

Describe the current behavior

Agent logs just state that pod never started and flow stays in Pending state

23:04:48.021 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:48.147 | INFO    | prefect.agent - Submitting flow run 'cd17f57d-13ce-4a8c-92ef-475d0e1067d4'
23:04:53.153 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:56.097 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'tuscan-flamingojqdmg-': Pod never started.

Example Use

No response

Additional context

using the helm chart for k8s agent as found here

Aug 11 '22 23:08 zzstoatzz

Related to PrefectHQ/prefect-kubernetes#90 and PrefectHQ/prefect#5489

Sep 18 '22 23:09 tekumara

Maybe we should use the same pattern as the ECS block where wait until the task starts and do not report the task as started until that occurs:

https://github.com/PrefectHQ/prefect-aws/blob/8f704436c64858d7e43254ab5f8b6249eea984e3/prefect_aws/ecs.py#L465-L469
https://github.com/PrefectHQ/prefect-aws/blob/8f704436c64858d7e43254ab5f8b6249eea984e3/prefect_aws/ecs.py#L375-L376

Sep 19 '22 01:09 zanieb

We experience same issues on 2.3.2. Would be cool to have it reported as failed and also get logs of that back in the run.

Sep 28 '22 10:09 avishniakov

I thought I'd have time to work on this but did not. This is open for contribution.

Sep 28 '22 14:09 zanieb

prefect prefect copied to clipboard

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?)

First check

Prefect Version

Describe the proposed behavior

Describe the current behavior

Example Use

Additional context

prefect
prefect copied to clipboard