DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

`dcgm diag -i <GPU>` not working correctly

Open bergentruckung opened this issue 1 year ago • 3 comments

Hello,

We have a box with multiple A100 80GB GPUs. Some of the GPUs are set to use MIGs (3 x 2g.20gb), while others have MIG disabled. When we try to run dcgm diagnostics to figure out if there are any issues with the GPU (with MIG disabled), it complains that there are other MIG instances that dcgm diag does not like (from here).

For example, gpu 0 is MIG disabled and I've explicitly asked to run diagnostics against it, but dcgm complains about gpu 2's MIG configuration.

% dcgmi diag -i 0 -r 1                                                                        
GPU 2's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU. 

This is the same when we explicitly create a new dcgm group, add only the MIG disabled GPU and then run the diagnostics again.

We're on dcgm 3.1.3. Let us know if there's a workaround for this. We don't want to disable MIG for other GPUs.

Thanks in advance.

bergentruckung avatar Mar 01 '23 18:03 bergentruckung

@bergentruckung,

Could you provide nvidia-smi and dcgmi discovery -c output?

nikkon-dev avatar Mar 02 '23 01:03 nikkon-dev

Yup, sure. Here's nvidia-smi:

% nvidia-smi
Thu Mar  2 01:24:02 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:17:00.0 Off |                    0 |
| N/A   29C    P0    43W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   29C    P0    43W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                   On |
| N/A   30C    P0    42W / 300W |     39MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:E3:00.0 Off |                   On |
| N/A   62C    P0   148W / 300W |  15939MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  2    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    4   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    5   0   2  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    4   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    5   0   2  |  15913MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      2MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    3    5    0    3940316      C   ...ython-3.10/std/bin/python    15894MiB |
+-----------------------------------------------------------------------------+

Here's dcgmi discovery -c:

% dcgmi discovery -c                                                         
+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy                                                                     |
+===================+====================================================================+
| GPU 2             | GPU GPU-94dfeb51-47c8-3804-f1d8-7752e08a9ab3 (EntityID: 2)         |
| -> I 2/3          | GPU Instance (EntityID: 14)                                        |
|    -> CI 2/3/0    | Compute Instance (EntityID: 14)                                    |
| -> I 2/4          | GPU Instance (EntityID: 15)                                        |
|    -> CI 2/4/0    | Compute Instance (EntityID: 15)                                    |
| -> I 2/5          | GPU Instance (EntityID: 16)                                        |
|    -> CI 2/5/0    | Compute Instance (EntityID: 16)                                    |
+-------------------+--------------------------------------------------------------------+
| GPU 3             | GPU GPU-9f0ac54c-bdc1-a128-0ed2-15a38c48769b (EntityID: 3)         |
| -> I 3/3          | GPU Instance (EntityID: 21)                                        |
|    -> CI 3/3/0    | Compute Instance (EntityID: 21)                                    |
| -> I 3/4          | GPU Instance (EntityID: 22)                                        |
|    -> CI 3/4/0    | Compute Instance (EntityID: 22)                                    |
| -> I 3/5          | GPU Instance (EntityID: 23)                                        |
|    -> CI 3/5/0    | Compute Instance (EntityID: 23)                                    |
+-------------------+--------------------------------------------------------------------+

bergentruckung avatar Mar 02 '23 07:03 bergentruckung

Running into similar issues.

yasirjamal87 avatar Aug 05 '23 05:08 yasirjamal87