dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster

Open Levi080513 opened this issue 7 months ago • 4 comments

Currently we use DCGM_FI_DEV_GPU_TEMP to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.

https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L784

https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L761

Levi080513 avatar Jan 22 '24 05:01 Levi080513

Can you try to use other metrics available on your vGPU?

nvvfedorov avatar Jan 25 '24 19:01 nvvfedorov

DCGM_FI_DEV_GPU_UTIL metrics is work well.

Levi080513 avatar Jan 26 '24 02:01 Levi080513

Can I submit a PR to fix it?

Levi080513 avatar Jan 26 '24 02:01 Levi080513

@Levi080513 , sure you can submit PRs; we appreciate community contribution.

nvvfedorov avatar Jan 26 '24 14:01 nvvfedorov