gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

How to configure dcgm metrics for MIG?

Open laszlocph opened this issue 1 year ago • 1 comments

Hello,

I am kinda in a rabbit hole:

  • DCGM_FI_DEV_GPU_UTIL is not supported for MIG devices https://github.com/NVIDIA/DCGM/issues/80#issuecomment-1537603016

  • DCGM_FI_PROF_SM_OCCUPANCY could be a substitute, but it is disabled by default in kubectl exec -it nvidia-dcgm-exporter-rh46x -- cat /etc/dcgm-exporter/dcp-metrics-included.csv | less

  • To enable DCGM_FI_PROF_* I found this issue, but the refferred piece of documentation is gone: https://github.com/NVIDIA/gpu-operator/issues/275#issuecomment-1323552018

Anybody managed to monitor MIG devices memory utilization? Anybody managed to configure custom metrics for dgcm-exporter?

Thank you.

laszlocph avatar Jun 28 '24 07:06 laszlocph

@laszlocph I ran into the same issues and raised an issue with the DCGM exporter: https://github.com/NVIDIA/dcgm-exporter/issues/353

frittentheke avatar Jul 05 '24 13:07 frittentheke

@laszlocph maybe you'd like to a take a look at my dashboard PR - https://github.com/NVIDIA/dcgm-exporter/pull/355 Depending on the result of https://github.com/NVIDIA/dcgm-exporter/issues/353 (de-duplication of metrics due to e.g. MIG) I might be able to remove some of the aggregations again.

There now also is a dedicated issue about cleaning up the labels: https://github.com/NVIDIA/dcgm-exporter/issues/356

frittentheke avatar Jul 13 '24 08:07 frittentheke

@frittentheke Amazing. This did it for us. Now we have utilization metrics for MIG partitions 🥳

laszlocph avatar Jul 18 '24 11:07 laszlocph