gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Wrong node capacity and allocatable when using MIG

Open xhejtman opened this issue 2 years ago • 7 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 6.2.0-37-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd, 1.7.7
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Rancher/RKE2, 1.27.8
  • GPU Operator Version: 23.9.1.

2. Issue or feature description

When MIG is enabled, both MIG resource and nvidia.com/gpu resource are reported as allocatable:

Allocatable:
  cerit.io/gpu-count:      2
  cerit.io/gpu-mem:        0
  cpu:                     64
  ephemeral-storage:       7104643354787
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  519659388Ki
  nvidia.com/gpu:          2
  nvidia.com/mig-1g.10gb:  6
  nvidia.com/mig-2g.20gb:  4
  nvidia.com/mig-3g.40gb:  0
  pods:                    160

which means that both requests nvidia.com/gpu and nvidia.com/mig-1g.10gb can land on the node, however, the nvidia.com/gpu request fails to inject GPU.

3. Steps to reproduce the issue

Enable MIG on A100 GPU.

This may be just a bug in Kubernetes, not the gpu operator itself.

xhejtman avatar Dec 17 '23 12:12 xhejtman

@xhejtman this is controlled by the mig.strategy: mixed parameter. When mixed strategy is used the device-plugin will

  • Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type
  • Expose individual MIG devices with a new resource type following the schema nvidia.com/mig-<slice_count>g.<memory_size>gb

So in your case, you do seem to have some GPUs with MIG disabled and others with enabled. Is that correct? Otherwise this would be a bug.

shivamerla avatar Dec 21 '23 00:12 shivamerla

I have both GPUs set into mig configuration:

Thu Dec 21 00:55:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:27:00.0 Off |                   On |
| N/A   50C    P0              83W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:A3:00.0 Off |                   On |
| N/A   48C    P0              81W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

xhejtman avatar Dec 21 '23 00:12 xhejtman

Ah, this seems to be a bug then. Will look into this. cc @elezar @klueska

shivamerla avatar Dec 21 '23 01:12 shivamerla

@xhejtman could you provide the logs from the device plugin?

elezar avatar Jan 08 '24 11:01 elezar

2.log

In meantime, I checked that Kubernetes 1.27.8 is not a problem, I have different cluster with 23.6.1 operator and it works ok.

xhejtman avatar Jan 08 '24 12:01 xhejtman

Looking at the logs, we're only starting 2 GRPC servers:

2023-12-18T12:40:20.600590354+01:00 stderr F I1218 11:40:20.600444       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-1g.10gb'
2023-12-18T12:40:20.601080041+01:00 stderr F I1218 11:40:20.600967       1 server.go:117] Starting to serve 'nvidia.com/mig-1g.10gb' on /var/lib/kubelet/device-plugins/nvidia-mig-1g.10gb.sock
2023-12-18T12:40:20.633441912+01:00 stderr F I1218 11:40:20.632289       1 server.go:125] Registered device plugin for 'nvidia.com/mig-1g.10gb' with Kubelet
2023-12-18T12:40:20.633473571+01:00 stderr F I1218 11:40:20.632494       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-2g.20gb'
2023-12-18T12:40:20.633492757+01:00 stderr F I1218 11:40:20.632946       1 server.go:117] Starting to serve 'nvidia.com/mig-2g.20gb' on /var/lib/kubelet/device-plugins/nvidia-mig-2g.20gb.sock
2023-12-18T12:40:20.649231279+01:00 stderr F I1218 11:40:20.644793       1 server.go:125] Registered device plugin for 'nvidia.com/mig-2g.20gb' with Kubelet

meaning that the running instance of the plugin should only be exposing these as allocatable resources.

Could you confirm that /var/lib/kubelet/device-plugins/ only references these two resource types? It could be that when applying the MIG config update the other socket was not removed.

elezar avatar Jan 08 '24 12:01 elezar

root@kub-as6:/var/lib/kubelet/device-plugins# ls -1
kubelet.sock
kubelet_internal_checkpoint
nvidia-mig-1g.10gb.sock
nvidia-mig-2g.20gb.sock
root@kub-as6:/var/lib/kubelet/device-plugins#

xhejtman avatar Jan 08 '24 12:01 xhejtman