DCGM
DCGM copied to clipboard
`dcgm diag -i <GPU>` not working correctly
Hello,
We have a box with multiple A100 80GB GPUs. Some of the GPUs are set to use MIGs (3 x 2g.20gb
), while others have MIG disabled. When we try to run dcgm diagnostics to figure out if there are any issues with the GPU (with MIG disabled), it complains that there are other MIG instances that dcgm diag
does not like (from here).
For example, gpu 0 is MIG disabled and I've explicitly asked to run diagnostics against it, but dcgm complains about gpu 2's MIG configuration.
% dcgmi diag -i 0 -r 1
GPU 2's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
This is the same when we explicitly create a new dcgm group, add only the MIG disabled GPU and then run the diagnostics again.
We're on dcgm 3.1.3. Let us know if there's a workaround for this. We don't want to disable MIG for other GPUs.
Thanks in advance.
@bergentruckung,
Could you provide nvidia-smi
and dcgmi discovery -c
output?
Yup, sure. Here's nvidia-smi
:
% nvidia-smi
Thu Mar 2 01:24:02 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:17:00.0 Off | 0 |
| N/A 29C P0 43W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:65:00.0 Off | 0 |
| N/A 29C P0 43W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... On | 00000000:CA:00.0 Off | On |
| N/A 30C P0 42W / 300W | 39MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... On | 00000000:E3:00.0 Off | On |
| N/A 62C P0 148W / 300W | 15939MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 2 3 0 0 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 4 0 1 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 5 0 2 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 3 0 0 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 4 0 1 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 5 0 2 | 15913MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 3 5 0 3940316 C ...ython-3.10/std/bin/python 15894MiB |
+-----------------------------------------------------------------------------+
Here's dcgmi discovery -c
:
% dcgmi discovery -c
+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy |
+===================+====================================================================+
| GPU 2 | GPU GPU-94dfeb51-47c8-3804-f1d8-7752e08a9ab3 (EntityID: 2) |
| -> I 2/3 | GPU Instance (EntityID: 14) |
| -> CI 2/3/0 | Compute Instance (EntityID: 14) |
| -> I 2/4 | GPU Instance (EntityID: 15) |
| -> CI 2/4/0 | Compute Instance (EntityID: 15) |
| -> I 2/5 | GPU Instance (EntityID: 16) |
| -> CI 2/5/0 | Compute Instance (EntityID: 16) |
+-------------------+--------------------------------------------------------------------+
| GPU 3 | GPU GPU-9f0ac54c-bdc1-a128-0ed2-15a38c48769b (EntityID: 3) |
| -> I 3/3 | GPU Instance (EntityID: 21) |
| -> CI 3/3/0 | Compute Instance (EntityID: 21) |
| -> I 3/4 | GPU Instance (EntityID: 22) |
| -> CI 3/4/0 | Compute Instance (EntityID: 22) |
| -> I 3/5 | GPU Instance (EntityID: 23) |
| -> CI 3/5/0 | Compute Instance (EntityID: 23) |
+-------------------+--------------------------------------------------------------------+
Running into similar issues.