DCGM
DCGM copied to clipboard
DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG
Hi,
I'm seeing some strange behavior of the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric with MIG instances. Namely, the maximum values vary by instance type and don't seem to make sense.
Here's the maximum values we see for various instance MIG instances on A100 80GB cards:
- 1g.10gb - 100%
- 2g.20gb - 50%
- 3g.40gb - 33%
- No-MIG - 100%
How is this metric meant to work with MIG?
Could you provide the nvidia-smi output for 2g.20gb and 3g.40gb MIG configurations? Such GR_ACTIVE utilization may happen if you create Compute Instances that do not occupy the whole MIG Instance. GR_ACTIVE is normalized to the full potential of the crated MIG instance (a compute instance may occupy the entire MIG instance or just one GPC).
Here's are examples for 2g.20gb and 3g.40gb instances:
nvidia-smi
Mon Dec 4 21:04:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:49:00.0 Off | On |
| N/A 29C P0 85W / 400W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Mon Dec 4 21:01:53 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:4F:00.0 Off | On |
| N/A 33C P0 84W / 400W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 19MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
From a quick glance, those values make sense if you have one GPC being used as @nikkon-dev said.
@neggert can you please attach the output from nvidia-smi -lci
and nvidia-smi -lgi
?
You can also try creating more compute instances within the larger GPU-Is to saturate the instance. That should get you to 100%.
Had to log into the node to run this, since normal Kubernetes containers don't seem to have the necessary permissions.
sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 1g.10gb 19 9 2:1 |
+-------------------------------------------------------+
| 0 MIG 1g.10gb 19 10 3:1 |
+-------------------------------------------------------+
| 0 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 0 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
| 1 MIG 1g.10gb 19 13 2:1 |
+-------------------------------------------------------+
| 1 MIG 1g.10gb 19 14 3:1 |
+-------------------------------------------------------+
| 1 MIG 2g.20gb 14 5 0:2 |
+-------------------------------------------------------+
| 1 MIG 3g.40gb 9 1 4:4 |
+-------------------------------------------------------+
| 2 MIG 1g.10gb 19 9 2:1 |
+-------------------------------------------------------+
| 2 MIG 1g.10gb 19 10 3:1 |
+-------------------------------------------------------+
| 2 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 2 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
| 3 MIG 1g.10gb 19 9 2:1 |
+-------------------------------------------------------+
| 3 MIG 1g.10gb 19 10 3:1 |
+-------------------------------------------------------+
| 3 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 3 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
| 4 MIG 1g.10gb 19 9 2:1 |
+-------------------------------------------------------+
| 4 MIG 1g.10gb 19 10 3:1 |
+-------------------------------------------------------+
| 4 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 4 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
| 5 MIG 1g.10gb 19 9 2:1 |
+-------------------------------------------------------+
| 5 MIG 1g.10gb 19 10 3:1 |
+-------------------------------------------------------+
| 5 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 5 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
sudo nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 9 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 10 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 3 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 0 2 MIG 3g.40gb 2 0 0:3 |
+--------------------------------------------------------------------+
| 1 13 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 1 14 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 1 5 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 1 1 MIG 3g.40gb 2 0 0:3 |
+--------------------------------------------------------------------+
| 2 9 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 2 10 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 2 3 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 2 2 MIG 3g.40gb 2 0 0:3 |
+--------------------------------------------------------------------+
| 3 9 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 3 10 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 3 3 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 3 2 MIG 3g.40gb 2 0 0:3 |
+--------------------------------------------------------------------+
| 4 9 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 4 10 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 4 3 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 4 2 MIG 3g.40gb 2 0 0:3 |
+--------------------------------------------------------------------+
| 5 9 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 5 10 MIG 1g.10gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 5 3 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 5 2 MIG 3g.40gb 2 0 0:3 |
+--------------------------------------------------------------------+
Based on the configuration attached, I would expect that GR_UTIL would be 100% as long as the CIs are saturated.
Would you mind testing on this same machine to confirm you're still seeing the values in the first comment?
Also, how are you generating a workload?
These values are consistent across all 2g.20gb and 3g.40gb instances across a 6-node, 48 GPU Kubernetes cluster. The numbers in the original post are derived by querying prometheus for the maximum value of these metric across all instances in a 7-day window.
This encompasses a variety of workloads, but I know for sure that there is some large batch-size LLM inference in there. This is a workload that achieves >90% utilization on a full A100, so it should have no trouble saturating the compute on a smaller MIG instance.
For what it's worth, here's the mig-parted-config that we're providing via the GPU operator.
version: v1
mig-configs:
"a100-80gb-x8-balanced":
- devices: [0, 1, 2, 3, 4, 5]
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
"3g.40gb": 1
- devices: [6, 7]
mig-enabled: false
@nvidia-aalsudani Any idea what's going on here? Do you need more information?
Hello @neggert, I'd like to suggest running the dcgmproftester12 tool from the DCGM package to create a synthetic load on the node (not pod) that has MIG enabled. I also have a hunch; does your Cuda application attempt to use more than one MIG instance? It's important to note that MIG does not behave like a physical GPU, and a Cuda application can only utilize the first MIG instance it detects. To run Cuda load on all MIG instances, the dcgmproftester tool uses a "fork bomb" method by forking a process and setting CUDA_VISIBLE_DEVICES to just a single MIG instance.
@nikkon-dev Sorry for the slow response. Took us a while to free up a machine I could access bare metal on. I get the same results when running dcgmproftool in the host OS.
sudo dcgmproftester12 --no-dcgm-validation -t 1001 -d 600 -i 3 --target-max-value