dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname)

Open frittentheke opened this issue 1 year ago • 5 comments

Running into various issues with the dashboard (see #353) I started reworking the existing board. This PR combines all my cleanups and fixes. It also includes the changes of PR https://github.com/NVIDIA/dcgm-exporter/pull/240 by @Levi080513

  • Use PromQL aggregations to take MIG subdevices into account (see #353)
  • Update all panels to use Timeseries panels (instead of deprecated Graph)
  • Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names
  • Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR #240)
  • Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: #353, #236

frittentheke avatar Jul 08 '24 09:07 frittentheke

Is there any news on this PR in terms of merging?

SohamG avatar Feb 20 '25 20:02 SohamG

^^ @rohit-arora-dev @glowkey ?

frittentheke avatar Feb 21 '25 07:02 frittentheke

I just took a moment to test these changes on an 8 GPU system with MIG enabled and unfortunately the panels were empty. I'm far from a Grafana expert so it's hard for me to know what was going wrong. I did confirm that without the changes the panels displayed the expected data.

glowkey avatar Mar 05 '25 21:03 glowkey

I just took a moment to test these changes on an 8 GPU system with MIG enabled and unfortunately the panels were empty. I'm far from a Grafana expert so it's hard for me to know what was going wrong. I did confirm that without the changes the panels displayed the expected data.

Thanks for looking at my PR @glowkey Some more details on which graph and with which PromQL query doesn't work would be great.

frittentheke avatar Mar 06 '25 08:03 frittentheke

For the GPU Util i think the old way may be better, the prof module is proprietary and only supported on a small amount of gpus / configurations

The other changes seem great though

https://github.com/NVIDIA/dcgm-exporter/issues/380 it looks like most of the prom metrics are not really reliable enough to be used as the sole source of labels in all the different situations the exporter is used

kristiangronas avatar Mar 06 '25 18:03 kristiangronas

Thanks @frittentheke! I've tried it and definitely is a big improvement. I can see all MIG subdevices and changing to hostnames is much more intuitive.

ermitovski avatar Jul 19 '25 15:07 ermitovski