agent-stack-k8s icon indicating copy to clipboard operation
agent-stack-k8s copied to clipboard

Jobs that ran into OOM issues appear as still running in buildkite UI

Open wallyqs opened this issue 1 year ago • 1 comments

For example, a container-0 job ran into this OOM so it has already exited:

 - containerID: containerd://868661c9da807af9428729518d1c95a52c1bb5efac68df8799cd6b24b475125c
    image: docker.io/library/golang:1.20-alpine
    imageID: docker.io/library/golang@sha256:59fc0dc542a38bb5b94cd1529e5f4663b4e7cc2f4a6c352b826dafe00d820031
    lastState: {}
    name: container-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://868661c9da807af9428729518d1c95a52c1bb5efac68df8799cd6b24b475125c
        exitCode: 137
        finishedAt: "2023-07-14T10:40:18Z"
        reason: OOMKilled
        startedAt: "2023-07-14T10:31:49Z"

But in the buildkite UI it still appears running:

image

Maybe need a way for the controller to detect OOM events in the jobs to clean them up?

wallyqs avatar Jul 14 '23 13:07 wallyqs

Thanks for raising this @wallyqs. We have a plan for how to proceed. It involves detecting OOM killed containers from the controller and cancelling them on Buildkite. We'll let you know when this is implemented. Let us know if there are more things to clean up for OOM killed jobs that we should catch as well.

triarius avatar Jul 19 '23 03:07 triarius