dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Add a health status metric for every gpu card

Open lx1036 opened this issue 5 months ago • 1 comments

Ask your question

I'm curious, why aren't there any health status metrics for every GPU card?

I check the NVIDIA/go-dcgm has function like HealthCheckByGpuId(gpuId uint) https://github.com/NVIDIA/go-dcgm/blob/main/pkg/dcgm/api.go#L102-L105 , and if we can get health status for every card, and export health metric.

lx1036 avatar Aug 30 '24 09:08 lx1036