dcgm-exporter
dcgm-exporter copied to clipboard
Add a health status metric for every gpu card
Ask your question
I'm curious, why aren't there any health status metrics for every GPU card?
I check the NVIDIA/go-dcgm has function like HealthCheckByGpuId(gpuId uint) https://github.com/NVIDIA/go-dcgm/blob/main/pkg/dcgm/api.go#L102-L105 , and if we can get health status for every card, and export health metric.