prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?)

Open zzstoatzz opened this issue 2 years ago • 2 comments

First check

  • [X] I added a descriptive title to this issue.
  • [X] I used the GitHub search to find a similar request and didn't find it.
  • [X] I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the proposed behavior

If there is an issue with the k8s agent setup (e.g. an instance of a kubernetes-job block references the incorrect cluster service account) such that the pod is never able to start, we could surface job events as kubectl describe job tuscan-flamingojqdmg- would:

....
Events:
  Type     Reason        Age                    From            Message
  ----     ------        ----                   ----            -------
  Warning  FailedCreate  3m30s (x27 over 134m)  job-controller  Error creating: pods "tuscan-flamingojqdmg-" is forbidden: error looking up service account prefect-2/prefect-agent-1659997862: serviceaccount "prefect-agent-1659997862" not found

... and mark the corresponding flow as Failed

Describe the current behavior

Agent logs just state that pod never started and flow stays in Pending state

23:04:48.021 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:48.147 | INFO    | prefect.agent - Submitting flow run 'cd17f57d-13ce-4a8c-92ef-475d0e1067d4'
23:04:53.153 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:56.097 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'tuscan-flamingojqdmg-': Pod never started.

Example Use

No response

Additional context

using the helm chart for k8s agent as found here

zzstoatzz avatar Aug 11 '22 23:08 zzstoatzz

Related to PrefectHQ/prefect-kubernetes#90 and PrefectHQ/prefect#5489

tekumara avatar Sep 18 '22 23:09 tekumara

Maybe we should use the same pattern as the ECS block where wait until the task starts and do not report the task as started until that occurs:

  • https://github.com/PrefectHQ/prefect-aws/blob/8f704436c64858d7e43254ab5f8b6249eea984e3/prefect_aws/ecs.py#L465-L469
  • https://github.com/PrefectHQ/prefect-aws/blob/8f704436c64858d7e43254ab5f8b6249eea984e3/prefect_aws/ecs.py#L375-L376

zanieb avatar Sep 19 '22 01:09 zanieb

We experience same issues on 2.3.2. Would be cool to have it reported as failed and also get logs of that back in the run.

avishniakov avatar Sep 28 '22 10:09 avishniakov

I thought I'd have time to work on this but did not. This is open for contribution.

zanieb avatar Sep 28 '22 14:09 zanieb