[Enhancement] Support for GPU state anomaly detection

Open XbaoWu opened this issue 8 months ago • 0 comments

What is the problem you're trying to solve

At present, the logic of the devices plugin is only to report the GPU resources on the node at the time of startup, but if the GPU on the node is abnormal ( such as card dropout or in the draining state, etc. ), the current devices plugin will not automatically update the amount of GPU resources reported. If we encounter some common or known exceptions, I think we can let the devices report the latest available GPU resources.

Describe the solution you'd like

When encountering certain pre-identified anomalies, the action reported by the GPU can be re-trigger on the devices.

Or is this issue worthy of our attention and resolution? If anyone has alternative proposals or insights, I sincerely welcome in-depth discussions and exchanges on this matter.

Additional context

No response

Apr 27 '25 14:04 XbaoWu