k8s-device-plugin How can I hide a subset of GPUs?

1. Issue or feature description

I'd like to allocate only a subset of GPUs on nodes to containers but I think I lost a way. The envvar NVIDIA_VISIBLE_DEVICES does not seem to work properly as mentioned in #197 and #236 . Even though I configured NVIDIA_VISIBLE_DEVICES to a subset of GPU IDs such as 0,1, all GPUs are still being scheduled to containers.

Please correct me if I miss something. Thank you!

2. Steps to reproduce the issue

Component Versions
- k8s: v1.21
- gpu-operator: v22.9.2
You can see the NVIDIA_VISIBLE_DEVICES set properly.

$ vi values.yaml
...
devicePlugin:
  enabled: true
  repository: nvcr.io/nvidia
  image: k8s-device-plugin
  version: v0.13.0-ubi8
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  args: []
  env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY
      value: envvar
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: "0,1"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
  resources: {}
...

Install gpu-operator by executing the following:

$ helm install -n gpu-operator gpu-operator ./ -f ./values.yaml --set psp.enabled=true
W0822 17:43:42.630182 3461180 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0822 17:43:42.683118 3461180 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: gpu-operator
LAST DEPLOYED: Tue Aug 22 17:43:42 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
$

You can find all the corresponding PODs running well:

$ k -n gpu-operator get all
NAME                                                              READY   STATUS      RESTARTS   AGE
pod/gpu-feature-discovery-bf8jn                                   1/1     Running     0          6m36s
pod/gpu-operator-59db9d5cfb-bssq7                                 1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-master-59b4b67f4f-qbjqj   1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-92n9t              1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-9t7ft              1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-hbmtb              1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-r972t              1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-s8sv9              1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-ttwrg              1/1     Running     0          7m21s
pod/gpu-operator-node-feature-discovery-worker-xbgpn              1/1     Running     0          7m21s
pod/nvidia-container-toolkit-daemonset-c4hn6                      1/1     Running     0          6m36s
pod/nvidia-cuda-validator-zskzg                                   0/1     Completed   0          6m12s
pod/nvidia-dcgm-exporter-52pcn                                    1/1     Running     0          6m36s
pod/nvidia-device-plugin-daemonset-qtjkb                          1/1     Running     0          6m36s
pod/nvidia-device-plugin-validator-8b4ps                          0/1     Completed   0          5m56s
pod/nvidia-mig-manager-c2s22                                      1/1     Running     0          6m36s
pod/nvidia-operator-validator-bltcz                               1/1     Running     0          6m36s

NAME                                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/dcgm-exporter                                NodePort    10.233.3.232    <none>        9400:30081/TCP   173d
service/gpu-operator                                 ClusterIP   10.233.32.26    <none>        8080/TCP         6m36s
service/gpu-operator-node-feature-discovery-master   ClusterIP   10.233.59.128   <none>        8080/TCP         7m21s
service/nvidia-dcgm-exporter                         ClusterIP   10.233.33.249   <none>        9400/TCP         6m36s

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
daemonset.apps/gpu-feature-discovery                        1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   6m36s
daemonset.apps/gpu-operator-node-feature-discovery-worker   7         7         7       7            7           <none>                                             7m21s
daemonset.apps/nvidia-container-toolkit-daemonset           1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       6m36s
daemonset.apps/nvidia-dcgm-exporter                         1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           6m36s
daemonset.apps/nvidia-device-plugin-daemonset               1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           6m36s
daemonset.apps/nvidia-mig-manager                           1         1         1       1            1           nvidia.com/gpu.deploy.mig-manager=true             6m36s
daemonset.apps/nvidia-operator-validator                    1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      6m36s

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator                                 1/1     1            1           7m21s
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           7m21s

NAME                                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-59db9d5cfb                                 1         1         1       7m21s
replicaset.apps/gpu-operator-node-feature-discovery-master-59b4b67f4f   1         1         1       7m21s

You can also find the envvar NVIDIA_VISIBLE_DEVICES set properly in the device-plugin daemonset definition.

$ k -n gpu-operator describe ds nvidia-device-plugin-daemonset
...
    Environment:
      PASS_DEVICE_SPECS:           true
      FAIL_ON_INIT_ERROR:          true
      DEVICE_LIST_STRATEGY:        envvar
      DEVICE_ID_STRATEGY:          uuid
      NVIDIA_VISIBLE_DEVICES:      0,1
...

But STILL the number of the allocatable GPUs on a node is 8:

$ k describe no gpu-node-01
Capacity:
...
  nvidia.com/gpu:         8
Allocatable:
...
  nvidia.com/gpu:         8

And still a POD is able to have more than 1 GPUs as follows:

$ k describe po mychat-57f6d88d96-strp5
...
    Limits:
      nvidia.com/gpu:  4
    Requests:
      nvidia.com/gpu:  4
...

$ k exec -ti mychat-57f6d88d96-strp5 -- bash
root@mychat-57f6d88d96-strp5:/opt# nvidia-smi
Tue Aug 22 08:56:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   34C    P0    68W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0    71W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   35C    P0    67W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   35C    P0    67W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Aug 22 '23 08:08 nfsp3k

The primary issue here is that the device plugin is started in privileged mode to have access to the device nodes for enumeration. This means that the NVIDIA_VISIBLE_DEVICES environment variable has no effect -- except that it ensures that the required libraries for enumerating the devices are mounted into the container.

Allowing for a set of devices to be selected for use with the device plugin is something that we have discussed internally, but we don't have a timeline or concrete plan of implementation.

Aug 22 '23 09:08 elezar

If in the daemonset nvidia-device-plugin-daemonset you remove the env NVIDIA_MIG_MONITOR_DEVICES, you can switch to privileged: false and the NVIDIA_VISIBLE_DEVICES works for me. I don’t have cards with MIG support.

I’m able to use a random subset of the GPUs of my node. describe node reports only 2 of my gpus and pods are scheduled only on these GPUs.

# gpu-operator-v23.9.1
driver:
  enabled: false
migManager:
  enabled: false
mig:
  strategy: single
toolkit:
  enabled: true
nfd:
  enabled: true
devicePlugin:
  env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY
      value: envvar
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: GPU-XXX,GPU-XXX
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Maybe I should do the same thing to the gpu-feature-discovery daemonset.

Feb 06 '24 15:02 Baenimyr

@Baenimyr the configuration are for nvidia device plugin? how do i make just one node to hide a subset of GPUs?

Apr 24 '24 13:04 jeffreyyjp

@Baenimyr the configuration are for nvidia device plugin? how do i make just one node to hide a subset of GPUs?

With NVIDIA_VISIBLE_DEVICES and privileged: false, the configuration is the same for all nodes because nvidia-device-plugin is a daemonset. Nvidia cards UUID are unique, so NVIDIA_VISIBLE_DEVICES is the list of all the visible cards for the whole cluster. This is not a blacklist but a whitelist. I suppose you don’t have two nodes sharing the same computer and aim to allocate a distinct set of GPUs to each node. It would be a bad idea to use two nodes on the same computer.

Apr 24 '24 14:04 Baenimyr