agent-stack-k8s
agent-stack-k8s copied to clipboard
Jobs that ran into OOM issues appear as still running in buildkite UI
For example, a container-0
job ran into this OOM so it has already exited:
- containerID: containerd://868661c9da807af9428729518d1c95a52c1bb5efac68df8799cd6b24b475125c
image: docker.io/library/golang:1.20-alpine
imageID: docker.io/library/golang@sha256:59fc0dc542a38bb5b94cd1529e5f4663b4e7cc2f4a6c352b826dafe00d820031
lastState: {}
name: container-0
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://868661c9da807af9428729518d1c95a52c1bb5efac68df8799cd6b24b475125c
exitCode: 137
finishedAt: "2023-07-14T10:40:18Z"
reason: OOMKilled
startedAt: "2023-07-14T10:31:49Z"
But in the buildkite UI it still appears running:
Maybe need a way for the controller to detect OOM events in the jobs to clean them up?
Thanks for raising this @wallyqs. We have a plan for how to proceed. It involves detecting OOM killed containers from the controller and cancelling them on Buildkite. We'll let you know when this is implemented. Let us know if there are more things to clean up for OOM killed jobs that we should catch as well.