gpu-operator dcgm-exporter collects metrics incorrectly?

Environment

● Kubernetes： 1.20.11 ● OS: Centos7(3.10.0-1160.15.2.el7.x86_64) ● Docker: 19.03.15 ● NVIDIA Driver Version: 510.47.03 ● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04

Issue description

There is no process is using gpus on my node, output of nvidia-smi:

# nvidia-smi

Tue Apr 26 15:29:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:54:00.0 Off |                    0 |
| N/A   33C    P0    67W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:5A:00.0 Off |                    0 |
| N/A   32C    P0    65W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:6B:00.0 Off |                    0 |
| N/A   32C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:70:00.0 Off |                    0 |
| N/A   34C    P0    71W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   33C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:C3:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:DA:00.0 Off |                    0 |
| N/A   32C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   33C    P0    66W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:

# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-14cdc0a4-f52f-a50c-f758-eb93e013e555",device="nvidia6",gpu="6",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-4eb035e5-2709-daa2-e3a1-5d2d8da60610",device="nvidia1",gpu="1",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-6fdb3b1a-3e7f-abc1-3e6a-7378a3cb2778",device="nvidia3",gpu="3",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-8ded7a54-6143-7898-563a-eb598623d740",device="nvidia0",gpu="0",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-a4f2126d-7321-144c-aa01-cdb6e7c8022a",device="nvidia7",gpu="7",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-bfbd4faf-d197-bd20-e054-2735bdd0c49e",device="nvidia4",gpu="4",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-c4d991e3-a1a7-0966-056b-a26d078c2f67",device="nvidia5",gpu="5",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-f7b1a978-75f0-a263-4206-8d7cd53641e0",device="nvidia2",gpu="2",modelName="NVIDIA A100-SXM4-80GB"} 850

At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB? And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus. Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?

May 05 '22 02:05 happy2048

@happy2048 Thanks for reporting this. I am working with DCGM team to get this analyzed.

May 05 '22 17:05 shivamerla

This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields.

May 05 '22 20:05 glowkey

This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields.

@glowkey is there any timeline to release that fix in DCGM?

May 12 '22 07:05 wsxiaozhang

Timeline is for a DCGM 2.4 based dcgm-exporter by the end of May 2022.

May 12 '22 19:05 glowkey

@glowkey I am assuming this bug was fixed in DCGM / dcgm-exporter. Can we close this issue?

Jan 31 '24 00:01 cdesiniotis

Yes, this is fixed!

Jan 31 '24 00:01 glowkey