k8s-device-plugin k8s-device-plugin seems to think gpu healthy when it is not usable due to Uncorrectable ECC Error

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

when one GPU on the host is abnormal due to Uncorrectable ECC Error, k8s-device-plugin seems to think it healthy. and my pod still be scheduled to this GPU, and everytime get RuntimeError: CUDA error: out of memory

2. Steps to reproduce the issue

Uncorrectable ECC Error cannot be reproduced easily. But when gpu run into Uncorrectable ECC Error, you can submit pod on this node and reproduce the CUDA error

3. Information to attach (optional if deemed irrelevant)

nvidia-smi :

You can see that the GPU 1 has 1 Volatile Uncorr. ECC.

is it possible for k8s-device-plugin to set GPU 1 unhealthy, thus pod will not be scheduled to this GPU until the Uncorrectable ECC Error fixed

Sep 29 '21 03:09 tingweiwu

Hi @tingweiwu. Thanks for reporting this. The device plugin marks GPUs unhealthy based on error events and it could be that we are missing this particular one. I will have a look to see whether this is the case.

Oct 08 '21 08:10 elezar

Yes, that very well could be the case. This has been a long standing "FIXME" in the code: https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/nvidia.go#L205

Oct 08 '21 08:10 klueska

@tingweiwu I have confirmed that the Xid=48 error is generated as a nvmlEventTypeDoubleBitEccError and not a nvmlEventTypeXidCriticalError (which is what the device plugin listens for).

I have created an internal ticket to track how this is handled.

Note that in the case of the V100 you could continue using the GPU if you retire the affected pages: https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html

Nov 04 '21 16:11 elezar

We encountered the same problem, and as @elezar says, ecc error is generated as nvmlEventTypeDoubleBitEccError and nvmlEventTypeSingleBitEccError, but https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/nvml/bindings.go#L48 only defined XidCriticalError = C.nvmlEventTypeXidCriticalError.

to fix the issue, gpu-monitoring-tools need to define DoubleBitEccError = C.nvmlEventTypeDoubleBitEccError and SingleBitEccError = C.nvmlEventTypeSingleBitEccError first and handle the ecc error in device_plugin.

Nov 26 '21 04:11 MC17

We are quite afflicted by this "bug". We are running around a lot of AI jobs a week, and we are seeing this happening once every week making 700-ish jobs fail.

Right now we don't have a good way of mitigating the problem, besides doing custom health check scripts and marking the node unhealthy. If the device plugin could handle this error, it could be really nice.

@elezar any progress on the internal ticket? :)

Jan 27 '23 16:01 sazo

@sazo with the release of 0.13.0 of the device plugin we have much of the work in place to make progress on this. We also added logging around the events that are detected and being skipped and if you have logs from a (v0.13.0) device plugin where you are seeing these ECC errors those would be useful to move things forward.

Jan 30 '23 11:01 elezar

Sorry to bother you here about a slightly related inquiry. Assuming that the plugin is able to detect unhealthy GPUs, what is the action taken? Is there a way to recover from these errors without evicting and re-creating the nodes (in the case of EKS)? Can you point me to some documentation on this? Thanks!

Sep 07 '23 16:09 ozancaglayan

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 28 '24 04:02 github-actions[bot]

k8s-device-plugin k8s-device-plugin copied to clipboard

k8s-device-plugin seems to think gpu healthy when it is not usable due to Uncorrectable ECC Error

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

k8s-device-plugin
k8s-device-plugin copied to clipboard