k8s-device-plugin
k8s-device-plugin copied to clipboard
k8s-device-plugin seems to think gpu healthy when it is not usable due to Uncorrectable ECC Error
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
when one GPU on the host is abnormal due to Uncorrectable ECC Error, k8s-device-plugin seems to think it healthy. and my pod still be scheduled to this GPU, and everytime get RuntimeError: CUDA error: out of memory
2. Steps to reproduce the issue
Uncorrectable ECC Error cannot be reproduced easily. But when gpu run into Uncorrectable ECC Error, you can submit pod on this node and reproduce the CUDA error
3. Information to attach (optional if deemed irrelevant)
nvidia-smi :
You can see that the GPU 1 has 1 Volatile Uncorr. ECC.
is it possible for k8s-device-plugin to set GPU 1 unhealthy, thus pod will not be scheduled to this GPU until the Uncorrectable ECC Error fixed
Hi @tingweiwu. Thanks for reporting this. The device plugin marks GPUs unhealthy based on error events and it could be that we are missing this particular one. I will have a look to see whether this is the case.
Yes, that very well could be the case. This has been a long standing "FIXME" in the code: https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/nvidia.go#L205
@tingweiwu I have confirmed that the Xid=48 error is generated as a nvmlEventTypeDoubleBitEccError
and not a nvmlEventTypeXidCriticalError
(which is what the device plugin listens for).
I have created an internal ticket to track how this is handled.
Note that in the case of the V100 you could continue using the GPU if you retire the affected pages: https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html
We encountered the same problem, and as @elezar says, ecc error is generated as nvmlEventTypeDoubleBitEccError
and nvmlEventTypeSingleBitEccError
, but https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/nvml/bindings.go#L48 only defined XidCriticalError = C.nvmlEventTypeXidCriticalError
.
to fix the issue, gpu-monitoring-tools need to define DoubleBitEccError = C.nvmlEventTypeDoubleBitEccError
and SingleBitEccError = C.nvmlEventTypeSingleBitEccError
first and handle the ecc error in device_plugin.
We are quite afflicted by this "bug". We are running around a lot of AI jobs a week, and we are seeing this happening once every week making 700-ish jobs fail.
Right now we don't have a good way of mitigating the problem, besides doing custom health check scripts and marking the node unhealthy. If the device plugin could handle this error, it could be really nice.
@elezar any progress on the internal ticket? :)
@sazo with the release of 0.13.0 of the device plugin we have much of the work in place to make progress on this. We also added logging around the events that are detected and being skipped and if you have logs from a (v0.13.0
) device plugin where you are seeing these ECC errors those would be useful to move things forward.
Sorry to bother you here about a slightly related inquiry. Assuming that the plugin is able to detect unhealthy GPUs, what is the action taken? Is there a way to recover from these errors without evicting and re-creating the nodes (in the case of EKS)? Can you point me to some documentation on this? Thanks!
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.