DCGM DCGM_FI_PROF_GR_ENGINE

Hi,

I'm seeing some strange behavior of the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric with MIG instances. Namely, the maximum values vary by instance type and don't seem to make sense.

Here's the maximum values we see for various instance MIG instances on A100 80GB cards:

1g.10gb - 100%
2g.20gb - 50%
3g.40gb - 33%
No-MIG - 100%

How is this metric meant to work with MIG?

Dec 04 '23 17:12 neggert

Could you provide the nvidia-smi output for 2g.20gb and 3g.40gb MIG configurations? Such GR_ACTIVE utilization may happen if you create Compute Instances that do not occupy the whole MIG Instance. GR_ACTIVE is normalized to the full potential of the crated MIG instance (a compute instance may occupy the entire MIG instance or just one GPC).

Dec 04 '23 19:12 nikkon-dev

Here's are examples for 2g.20gb and 3g.40gb instances:

nvidia-smi
Mon Dec  4 21:04:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:49:00.0 Off |                   On |
| N/A   29C    P0    85W / 400W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Mon Dec  4 21:01:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:4F:00.0 Off |                   On |
| N/A   33C    P0    84W / 400W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Dec 04 '23 21:12 neggert

From a quick glance, those values make sense if you have one GPC being used as @nikkon-dev said.

@neggert can you please attach the output from nvidia-smi -lci and nvidia-smi -lgi?

You can also try creating more compute instances within the larger GPU-Is to saturate the instance. That should get you to 100%.

Dec 05 '23 20:12 nvidia-aalsudani

Had to log into the node to run this, since normal Kubernetes containers don't seem to have the necessary permissions.

sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 1g.10gb         19        9          2:1     |
+-------------------------------------------------------+
|   0  MIG 1g.10gb         19       10          3:1     |
+-------------------------------------------------------+
|   0  MIG 2g.20gb         14        3          0:2     |
+-------------------------------------------------------+
|   0  MIG 3g.40gb          9        2          4:4     |
+-------------------------------------------------------+
|   1  MIG 1g.10gb         19       13          2:1     |
+-------------------------------------------------------+
|   1  MIG 1g.10gb         19       14          3:1     |
+-------------------------------------------------------+
|   1  MIG 2g.20gb         14        5          0:2     |
+-------------------------------------------------------+
|   1  MIG 3g.40gb          9        1          4:4     |
+-------------------------------------------------------+
|   2  MIG 1g.10gb         19        9          2:1     |
+-------------------------------------------------------+
|   2  MIG 1g.10gb         19       10          3:1     |
+-------------------------------------------------------+
|   2  MIG 2g.20gb         14        3          0:2     |
+-------------------------------------------------------+
|   2  MIG 3g.40gb          9        2          4:4     |
+-------------------------------------------------------+
|   3  MIG 1g.10gb         19        9          2:1     |
+-------------------------------------------------------+
|   3  MIG 1g.10gb         19       10          3:1     |
+-------------------------------------------------------+
|   3  MIG 2g.20gb         14        3          0:2     |
+-------------------------------------------------------+
|   3  MIG 3g.40gb          9        2          4:4     |
+-------------------------------------------------------+
|   4  MIG 1g.10gb         19        9          2:1     |
+-------------------------------------------------------+
|   4  MIG 1g.10gb         19       10          3:1     |
+-------------------------------------------------------+
|   4  MIG 2g.20gb         14        3          0:2     |
+-------------------------------------------------------+
|   4  MIG 3g.40gb          9        2          4:4     |
+-------------------------------------------------------+
|   5  MIG 1g.10gb         19        9          2:1     |
+-------------------------------------------------------+
|   5  MIG 1g.10gb         19       10          3:1     |
+-------------------------------------------------------+
|   5  MIG 2g.20gb         14        3          0:2     |
+-------------------------------------------------------+
|   5  MIG 3g.40gb          9        2          4:4     |
+-------------------------------------------------------+

sudo nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      9       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   0     10       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   0      3       MIG 2g.20gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   0      2       MIG 3g.40gb          2         0          0:3     |
+--------------------------------------------------------------------+
|   1     13       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   1     14       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   1      5       MIG 2g.20gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   1      1       MIG 3g.40gb          2         0          0:3     |
+--------------------------------------------------------------------+
|   2      9       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   2     10       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   2      3       MIG 2g.20gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   2      2       MIG 3g.40gb          2         0          0:3     |
+--------------------------------------------------------------------+
|   3      9       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   3     10       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   3      3       MIG 2g.20gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   3      2       MIG 3g.40gb          2         0          0:3     |
+--------------------------------------------------------------------+
|   4      9       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   4     10       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   4      3       MIG 2g.20gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   4      2       MIG 3g.40gb          2         0          0:3     |
+--------------------------------------------------------------------+
|   5      9       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   5     10       MIG 1g.10gb          0         0          0:1     |
+--------------------------------------------------------------------+
|   5      3       MIG 2g.20gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   5      2       MIG 3g.40gb          2         0          0:3     |
+--------------------------------------------------------------------+

Dec 05 '23 21:12 neggert

Based on the configuration attached, I would expect that GR_UTIL would be 100% as long as the CIs are saturated.

Would you mind testing on this same machine to confirm you're still seeing the values in the first comment?

Also, how are you generating a workload?

Dec 05 '23 22:12 nvidia-aalsudani

These values are consistent across all 2g.20gb and 3g.40gb instances across a 6-node, 48 GPU Kubernetes cluster. The numbers in the original post are derived by querying prometheus for the maximum value of these metric across all instances in a 7-day window.

This encompasses a variety of workloads, but I know for sure that there is some large batch-size LLM inference in there. This is a workload that achieves >90% utilization on a full A100, so it should have no trouble saturating the compute on a smaller MIG instance.

Dec 06 '23 15:12 neggert

For what it's worth, here's the mig-parted-config that we're providing via the GPU operator.

version: v1
mig-configs:
  "a100-80gb-x8-balanced":
    - devices: [0, 1, 2, 3, 4, 5]
      mig-enabled: true
      mig-devices:
        "1g.10gb": 2
        "2g.20gb": 1
        "3g.40gb": 1
    - devices: [6, 7]
      mig-enabled: false

Dec 06 '23 16:12 neggert

@nvidia-aalsudani Any idea what's going on here? Do you need more information?

Jan 29 '24 19:01 neggert

Hello @neggert, I'd like to suggest running the dcgmproftester12 tool from the DCGM package to create a synthetic load on the node (not pod) that has MIG enabled. I also have a hunch; does your Cuda application attempt to use more than one MIG instance? It's important to note that MIG does not behave like a physical GPU, and a Cuda application can only utilize the first MIG instance it detects. To run Cuda load on all MIG instances, the dcgmproftester tool uses a "fork bomb" method by forking a process and setting CUDA_VISIBLE_DEVICES to just a single MIG instance.

Feb 04 '24 03:02 nikkon-dev

@nikkon-dev Sorry for the slow response. Took us a while to free up a machine I could access bare metal on. I get the same results when running dcgmproftool in the host OS.

sudo dcgmproftester12 --no-dcgm-validation -t 1001 -d 600 -i 3 --target-max-value

Mar 22 '24 19:03 neggert

DCGM
DCGM copied to clipboard

DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG

DCGM DCGM copied to clipboard

DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG

DCGM
DCGM copied to clipboard