[Enhancement] Support for GPU state anomaly detection
What is the problem you're trying to solve
At present, the logic of the devices plugin is only to report the GPU resources on the node at the time of startup, but if the GPU on the node is abnormal ( such as card dropout or in the draining state, etc. ), the current devices plugin will not automatically update the amount of GPU resources reported. If we encounter some common or known exceptions, I think we can let the devices report the latest available GPU resources.
Describe the solution you'd like
When encountering certain pre-identified anomalies, the action reported by the GPU can be re-trigger on the devices.
Or is this issue worthy of our attention and resolution? If anyone has alternative proposals or insights, I sincerely welcome in-depth discussions and exchanges on this matter.
Additional context
No response