fix: lingering GPU pods on cluster restart

Open CollectiveUnicorn opened this issue 2 years ago • 0 comments

Bug GPU pods linger with the status UnexpectedAdmissionError on cluster restart and trigger replicas to be created.

Expected No pods with UnexpectedAdmissionError linger in the cluster on restart.

Context When the cluster is restarted, the existing GPU pods seem to fail with an UnexpectedAdmissionError due to not being able to allocate GPUs. This is believed to be due to a race condition with the nvidia-device-plugin daemonset which is required to allocate GPUs. This leads to additional replicas being spun up successfully. The old pods however remain until manual deletion. These do not affect functionality, but do lead to confusion.

Reproduce Restart a cluster with running GPU pods.

Apr 03 '24 16:04 CollectiveUnicorn