gpu-operator nvidia/gpu-operator exposes all GPUs in Kubefllow

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node?
[x] Are you running Kubernetes v1.13+?
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Hi, I'm deploying Kubeflow Manifests with nvidia/gpu-operator and doing experiments in Jupyter notebooks. However, even though I don't specify the Number of GPUs, which expects no GPUs to be exposed, all GPUs are still exposed in the notebooks. I don't how which causes the problem or any workaround for this one. Do you have any idea?

2. Steps to reproduce the issue

helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false

Oct 16 '22 05:10 hoangtnm

@hoangtnm This will happen when NVIDIA_VISIBLE_DEVICES=all environment variable is set in the image you are using (which is true for most of the cuda images). Please refer to this document on ways to prevent using this env without privileged mode.

Oct 17 '22 05:10 shivamerla

@shivamerla Thank you for your prompt reply. As I understand your document, these following modifications should be made into /etc/nvidia-container-runtime/config.toml:

accept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = false

However, when I read another document on page 3 by @klueska, the modifications should be:

accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true

Therefore, which one is the right way for this problem?

Oct 17 '22 10:10 hoangtnm

@hoangtnm the document that was linked by @shivamerla describes the options and their behaviour. If you want to prevent the use of the NVIDIA_VISIBLE_DEVICES environment variable in unprivileged containers, you will have to set:

accept-nvidia-visible-devices-envvar-when-unprivileged = false

If you then want the device plugin to use volume mounts (and not the envvar) to specify devices you additionally need to set:

accept-nvidia-visible-devices-as-volume-mounts = true

And ensure that the device pluging is tarted with the device-list-strategy=volume-mounts option set.

Oct 17 '22 11:10 elezar

to make this clear, with operator following params needs to be set during helm install.

devicePlugin:
  env:
    ...
    - name: DEVICE_LIST_STRATEGY
      value: volume-mounts
toolkit:
  env:
  ...
     - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
       value: "false"

Oct 17 '22 17:10 shivamerla

Hi @elezar @shivamerla, thank you for your prompt replies, they helped my a lot :D However, I just want to confirm the steps along with how to apply those params during gpu-operator helm install:

helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set toolkit.env[0].value="false" \
  --wait

Is this correct or do I miss something to make it work? Besides, as I understand with this helm install, I don't need to install nvidia-container-runtime and config ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED on the host anymore, and the toolkit-daemonset will handle ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED inside container without any problem, right?

Oct 18 '22 02:10 hoangtnm

I think your understanding is correct. Using the toolkit daemonset should configure the toolkit as required. Note that you would also need to set the ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS="true" for the toolkit to trigger it to consider volume mounts.

Oct 18 '22 08:10 elezar

@elezar For me, it solved the issue with the environment variable, however, if I do a small test with volume mounts enabled (configuration), it runs into errors:

cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "example.com/nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
        nvidia.com/mig-1g.6gb: 1
EOF
pod/cuda-vectoradd created

Pod logs:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Without volumes, it worked!

Oct 18 '22 08:10 Alwinator

@Alwinator could you please confirm your GPU operator version and values that you have applied to your helm deployment?

Oct 18 '22 08:10 elezar

OpenShift 4.10.35 with Kubernetes 1.23 OS on nodes: Red Hat Enterprise Linux CoreOS 410.84.202209231843-0 CRI-O 1.23.3-17.rhaos4.10.git016b1ca.el8 GPU Operator version: 22.9.0 Helm version: v3.6.3

Oct 18 '22 09:10 Alwinator

OpenShift 4.10.35 with Kubernetes 1.23 OS on nodes: Red Hat Enterprise Linux CoreOS 410.84.202209231843-0 CRI-O 1.23.3-17.rhaos4.10.git016b1ca.el8 GPU Operator version: 22.9.0 Helm version: v3.6.3

Thanks. And the values used to install the operator?

Oct 18 '22 09:10 elezar

The Nvidia Operator 22.9.0 is installed in the nvidia-gpu-operator namespace, with update approval automatic and the cluster policy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: nvidia-dcgm-exporter-custom-config
    enabled: true
    serviceMonitor:
      enabled: true
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    rollingUpdate:
      maxUnavailable: '1'
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
    env:
      - name: DEVICE_LIST_STRATEGY
        value: volume-mounts
  mig:
    strategy: mixed
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets: {}
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    env:
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
        value: 'false'
    installDir: /usr/local/nvidia
status:
  namespace: nvidia-gpu-operator
  state: ready

Oct 18 '22 10:10 Alwinator

@Alwinator as mentioned in https://github.com/NVIDIA/gpu-operator/issues/421#issuecomment-1281983276 could you ALSO please set the ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS="true" in the toolkit.env and see if this addresses your issue?

What is happening now is that the device plugin is specifying devices as volume mounts, but the NVIDIA Container Runtime Hook is not configured to interpret these as device requests and most likely does not make any modifications to the container.

Oct 18 '22 10:10 elezar

@elezar Thank you for your support. I am sorry, I missed this one. With the new variable set, it works seamlessly. Also, thank you for the explanation that makes a lot of sense. :)

Oct 18 '22 11:10 Alwinator

Hi @elezar @shivamerla, I tried to install gpu-operator with the following params but it doesn't work. Do you have any idea?

➜  manifests git:(f038f81) helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set toolkit.env[0].value="false" \
  --wait
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
zsh: no matches found: devicePlugin.env[0].name=DEVICE_LIST_STRATEGY

Oct 19 '22 04:10 hoangtnm

@hoangtnm Can you try below command.

helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[0].value="false" \
  -- set toolkit.env[1].name= ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[1].value="true" \
  --wait

Oct 19 '22 04:10 shivamerla

Hi @shivamerla, I fixed the problem with bash instead of zsh. It seems that zsh cannot parse such arguments properly.

Oct 19 '22 04:10 hoangtnm

@shivamerla @elezar I have a question about params for k8s-device-plugin, which is bundled inside gpu-operator. As I read Preventing unprivileged access to GPUs in Kubernetes document on page 4:

helm install \
    --version=0.7.0-rc.7 \
    --generate-name \
    --set securityContext.privileged=true \
    --set deviceListStrategy=volume-mounts \
    nvdp/nvidia-device-plugin

It sets securityContext.privileged=true along with deviceListStrategy=volume-mounts for the k8s-device-plugin v0.7.0-rc.7. Therefore, I wonder whether the securityContext is still necessary in the latest gpu-operator or is it configured with true by default?

Oct 19 '22 15:10 hoangtnm

@hoangtnm yes, with gpu-operator, device-plugin is always deployed with privileged mode.

Oct 19 '22 15:10 shivamerla

@shivamerla thank you for your prompt clarification :D

Oct 19 '22 15:10 hoangtnm

Closing this as original issue has been resolved.

Nov 29 '22 01:11 cdesiniotis

@shivamerla hi，i meet exactly the same problem, one pod limited by "--limits=nvdia.com/mig-1g.5gb=1" can get all gpus in that node,then

i modify /etc/nvidia-container-runtime/config.toml
use helm by this params helm install nvidia/gpu-operator
--generate-name --create-namespace --namespace=gpu-operator-resources --set driver.enabled=false
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY
--set devicePlugin.env[0].value="volume-mounts"
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
--set-string toolkit.env[0].value="false"
--set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
--set-string toolkit.env[1].value="true"
--set mig.strategy=mixed --wait
run pod kubectl run -it --rm --image=nvidia/cuda:11.0-base --restart=Never --limits=nvidia.com/mig-1g.5gb=1 mig-none-example -- bash -c "nvidia-smi -L;sleep infinity" it seems that i can't use nvidia-smi in that pod

Jun 22 '23 10:06 ytaoeer

it seems to be more confusing. when i use "kubectl run", i get step 3 result. but when i use "kubectl apply -f test.yaml", everything is ok

Jul 01 '23 12:07 ytaoeer

gpu-operator gpu-operator copied to clipboard

nvidia/gpu-operator exposes all GPUs in Kubefllow

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

gpu-operator
gpu-operator copied to clipboard