dcgm-exporter collects metrics incorrectly?
Environment
● Kubernetes: 1.20.11 ● OS: Centos7(3.10.0-1160.15.2.el7.x86_64) ● Docker: 19.03.15 ● NVIDIA Driver Version: 510.47.03 ● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
Issue description
There is no process is using gpus on my node, output of nvidia-smi:
# nvidia-smi
Tue Apr 26 15:29:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:54:00.0 Off | 0 |
| N/A 33C P0 67W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:5A:00.0 Off | 0 |
| N/A 32C P0 65W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:6B:00.0 Off | 0 |
| N/A 32C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:70:00.0 Off | 0 |
| N/A 34C P0 71W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:BE:00.0 Off | 0 |
| N/A 33C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:C3:00.0 Off | 0 |
| N/A 30C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:DA:00.0 Off | 0 |
| N/A 32C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:E0:00.0 Off | 0 |
| N/A 33C P0 66W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-14cdc0a4-f52f-a50c-f758-eb93e013e555",device="nvidia6",gpu="6",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-4eb035e5-2709-daa2-e3a1-5d2d8da60610",device="nvidia1",gpu="1",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-6fdb3b1a-3e7f-abc1-3e6a-7378a3cb2778",device="nvidia3",gpu="3",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-8ded7a54-6143-7898-563a-eb598623d740",device="nvidia0",gpu="0",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-a4f2126d-7321-144c-aa01-cdb6e7c8022a",device="nvidia7",gpu="7",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-bfbd4faf-d197-bd20-e054-2735bdd0c49e",device="nvidia4",gpu="4",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-c4d991e3-a1a7-0966-056b-a26d078c2f67",device="nvidia5",gpu="5",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-f7b1a978-75f0-a263-4206-8d7cd53641e0",device="nvidia2",gpu="2",modelName="NVIDIA A100-SXM4-80GB"} 850
At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB? And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus. Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?
@happy2048 Thanks for reporting this. I am working with DCGM team to get this analyzed.
This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields.
This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields.
@glowkey is there any timeline to release that fix in DCGM?
Timeline is for a DCGM 2.4 based dcgm-exporter by the end of May 2022.
@glowkey I am assuming this bug was fixed in DCGM / dcgm-exporter. Can we close this issue?
Yes, this is fixed!