dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

High GPU power consumption with latest version (4.1.1-4.0.4)

Open jacquetpi opened this issue 9 months ago • 5 comments

What is the version?

4.1.1-4.0.4

What happened?

I observe a very high static power consumption on our A100 GPUs on the latest docker image The exporter seems to increase power consumption by 50W on all GPUs (MIG-enabled) compared to the previous image. Note that nothing is running on the accelerators. When MIG is disabled, the increase is only of ~5W (not shown below)

user@server:~$ nvidia-smi                                                                                                                                                      16:25:50 [42/42]
Wed Mar 19 17:25:02 2025                                                                                                                                                                                             
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |                                                                                                                          
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:06:00.0 Off |                   On |                                                                                                                          
| N/A   31C    P0             44W /  400W |       1MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:46:00.0 Off |                   On |
| N/A   34C    P0             45W /  400W |       1MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |                                                                                                                          
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:85:00.0 Off |                   On |                                                                                                                          
| N/A   31C    P0             43W /  400W |       1MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |                                                                                                                          
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:C7:00.0 Off |                   On |
| N/A   35C    P0             50W /  400W |       1MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
                                                     
+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|                                                                                                                          
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+
                                                                                                                                                                                                                     
+-----------------------------------------------------------------------------------------+    
| Processes:                                                                              |                                                                                                                          |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
user@server:~$ # Idle power
user@server:~$ nvidia-smi --query-gpu=index,name,power.draw --format=csv             
index, name, power.draw [W]
0, NVIDIA A100-SXM4-40GB, 44.17 W        
1, NVIDIA A100-SXM4-40GB, 45.32 W                                                                         
2, NVIDIA A100-SXM4-40GB, 43.43 W
3, NVIDIA A100-SXM4-40GB, 50.12 W        
user@server:~$ # Power consumption with previous exporter
user@server:~$ docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.3-ubuntu22.04
e2b02c72702b96f517cf59676553e69a29787ca756e3594e5d3509c1f325e1f5
user@server:~$ nvidia-smi --query-gpu=index,name,power.draw --format=csv             
index, name, power.draw [W]                                                                               
0, NVIDIA A100-SXM4-40GB, 44.17 W
1, NVIDIA A100-SXM4-40GB, 44.99 W        
2, NVIDIA A100-SXM4-40GB, 43.43 W
3, NVIDIA A100-SXM4-40GB, 49.90 W                                                                         
user@server:~$ docker stop dcgm-exporter                                                                                                                                                       
dcgm-exporter                    
user@server:~$ # Power consumption with latest exporter                                                                         
user@server:~$ docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubuntu22.04                                    
5094b2dee565672a8dc0082259a056e06f44dc28382b6baa80298132c34a93ad                           
user@server:~$ nvidia-smi --query-gpu=index,name,power.draw --format=csv                                                                                                                       
index, name, power.draw [W]                                                                               
0, NVIDIA A100-SXM4-40GB, 99.67 W                                                                         
1, NVIDIA A100-SXM4-40GB, 96.19 W                                                                         
2, NVIDIA A100-SXM4-40GB, 91.60 W                                                                         
3, NVIDIA A100-SXM4-40GB, 105.81 W     

What did you expect to happen?

To get the same static power consumption as before (around 50W and not 100W) when nothing else is running on the accelerators

What is the GPU model?

NVIDIA A100-SXM4-40GB

What is the environment?

Driver Version: 570.124.06
CUDA Version: 12.8 NVIDIA A100-SXM4-40GB MIG enabled

How did you deploy the dcgm-exporter and what is the configuration?

Default command:

docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.3-ubuntu22.04

How to reproduce the issue?

docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.3-ubuntu22.04
nvidia-smi --query-gpu=index,name,power.draw --format=csv
docker run -d --gpus all --cap-add SYS_ADMIN --name dcgm-exporter --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubuntu22.04
nvidia-smi --query-gpu=index,name,power.draw --format=csv

Anything else we need to know?

N/A

jacquetpi avatar Mar 19 '25 16:03 jacquetpi

This is expected. The difference between 4.0.3 and 4.0.4 was the inclusion of the datacenter-gpu-manager-4-proprietary package to the container. On A100 and older GPUs, this package uses perfworks to gather DCP metrics, which puts a slight load on the GPU.

glowkey avatar Mar 19 '25 20:03 glowkey

Thank you for your answer Would you know why the load is more pronounced with MIG?

jacquetpi avatar Mar 19 '25 23:03 jacquetpi

Gathering additional DCP metrics for all the MIG devices requires more queries.

glowkey avatar Mar 19 '25 23:03 glowkey

Can a parameter disable this additional profiling?

jacquetpi avatar Mar 20 '25 13:03 jacquetpi

You should be able to remove the DCP metrics from the watched metrics list to reduce the load, see https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling-metrics and https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv#L81

glowkey avatar Mar 20 '25 19:03 glowkey