High GPU power consumption with latest version (4.1.1-4.0.4)
What is the version?
4.1.1-4.0.4
What happened?
I observe a very high static power consumption on our A100 GPUs on the latest docker image The exporter seems to increase power consumption by 50W on all GPUs (MIG-enabled) compared to the previous image. Note that nothing is running on the accelerators. When MIG is disabled, the increase is only of ~5W (not shown below)
user@server:~$ nvidia-smi 16:25:50 [42/42]
Wed Mar 19 17:25:02 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:06:00.0 Off | On |
| N/A 31C P0 44W / 400W | 1MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:46:00.0 Off | On |
| N/A 34C P0 45W / 400W | 1MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:85:00.0 Off | On |
| N/A 31C P0 43W / 400W | 1MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:C7:00.0 Off | On |
| N/A 35C P0 50W / 400W | 1MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+
| Processes: | | GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
user@server:~$ # Idle power
user@server:~$ nvidia-smi --query-gpu=index,name,power.draw --format=csv
index, name, power.draw [W]
0, NVIDIA A100-SXM4-40GB, 44.17 W
1, NVIDIA A100-SXM4-40GB, 45.32 W
2, NVIDIA A100-SXM4-40GB, 43.43 W
3, NVIDIA A100-SXM4-40GB, 50.12 W
user@server:~$ # Power consumption with previous exporter
user@server:~$ docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.3-ubuntu22.04
e2b02c72702b96f517cf59676553e69a29787ca756e3594e5d3509c1f325e1f5
user@server:~$ nvidia-smi --query-gpu=index,name,power.draw --format=csv
index, name, power.draw [W]
0, NVIDIA A100-SXM4-40GB, 44.17 W
1, NVIDIA A100-SXM4-40GB, 44.99 W
2, NVIDIA A100-SXM4-40GB, 43.43 W
3, NVIDIA A100-SXM4-40GB, 49.90 W
user@server:~$ docker stop dcgm-exporter
dcgm-exporter
user@server:~$ # Power consumption with latest exporter
user@server:~$ docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubuntu22.04
5094b2dee565672a8dc0082259a056e06f44dc28382b6baa80298132c34a93ad
user@server:~$ nvidia-smi --query-gpu=index,name,power.draw --format=csv
index, name, power.draw [W]
0, NVIDIA A100-SXM4-40GB, 99.67 W
1, NVIDIA A100-SXM4-40GB, 96.19 W
2, NVIDIA A100-SXM4-40GB, 91.60 W
3, NVIDIA A100-SXM4-40GB, 105.81 W
What did you expect to happen?
To get the same static power consumption as before (around 50W and not 100W) when nothing else is running on the accelerators
What is the GPU model?
NVIDIA A100-SXM4-40GB
What is the environment?
Driver Version: 570.124.06
CUDA Version: 12.8
NVIDIA A100-SXM4-40GB
MIG enabled
How did you deploy the dcgm-exporter and what is the configuration?
Default command:
docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.3-ubuntu22.04
How to reproduce the issue?
docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.3-ubuntu22.04
nvidia-smi --query-gpu=index,name,power.draw --format=csv
docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubuntu22.04
nvidia-smi --query-gpu=index,name,power.draw --format=csv
Anything else we need to know?
N/A
This is expected. The difference between 4.0.3 and 4.0.4 was the inclusion of the datacenter-gpu-manager-4-proprietary package to the container. On A100 and older GPUs, this package uses perfworks to gather DCP metrics, which puts a slight load on the GPU.
Thank you for your answer Would you know why the load is more pronounced with MIG?
Gathering additional DCP metrics for all the MIG devices requires more queries.
Can a parameter disable this additional profiling?
You should be able to remove the DCP metrics from the watched metrics list to reduce the load, see https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling-metrics and https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv#L81