gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

nvidia/gpu-operator exposes all GPUs in Kubefllow

Open hoangtnm opened this issue 3 years ago • 19 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [x] Are you running Kubernetes v1.13+?
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Hi, I'm deploying Kubeflow Manifests with nvidia/gpu-operator and doing experiments in Jupyter notebooks. However, even though I don't specify the Number of GPUs, which expects no GPUs to be exposed, all GPUs are still exposed in the notebooks. I don't how which causes the problem or any workaround for this one. Do you have any idea?

2. Steps to reproduce the issue

helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false
screen_shot_2022-10-14_at_14 23 19 screen_shot_2022-10-14_at_14 23 54

hoangtnm avatar Oct 16 '22 05:10 hoangtnm

@hoangtnm This will happen when NVIDIA_VISIBLE_DEVICES=all environment variable is set in the image you are using (which is true for most of the cuda images). Please refer to this document on ways to prevent using this env without privileged mode.

shivamerla avatar Oct 17 '22 05:10 shivamerla

@shivamerla Thank you for your prompt reply. As I understand your document, these following modifications should be made into /etc/nvidia-container-runtime/config.toml:

accept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = false

However, when I read another document on page 3 by @klueska, the modifications should be:

accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true

Therefore, which one is the right way for this problem?

hoangtnm avatar Oct 17 '22 10:10 hoangtnm

@hoangtnm the document that was linked by @shivamerla describes the options and their behaviour. If you want to prevent the use of the NVIDIA_VISIBLE_DEVICES environment variable in unprivileged containers, you will have to set:

accept-nvidia-visible-devices-envvar-when-unprivileged = false

If you then want the device plugin to use volume mounts (and not the envvar) to specify devices you additionally need to set:

accept-nvidia-visible-devices-as-volume-mounts = true

And ensure that the device pluging is tarted with the device-list-strategy=volume-mounts option set.

elezar avatar Oct 17 '22 11:10 elezar

to make this clear, with operator following params needs to be set during helm install.

devicePlugin:
  env:
    ...
    - name: DEVICE_LIST_STRATEGY
      value: volume-mounts
toolkit:
  env:
  ...
     - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
       value: "false"

shivamerla avatar Oct 17 '22 17:10 shivamerla

Hi @elezar @shivamerla, thank you for your prompt replies, they helped my a lot :D However, I just want to confirm the steps along with how to apply those params during gpu-operator helm install:

helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set toolkit.env[0].value="false" \
  --wait

Is this correct or do I miss something to make it work? Besides, as I understand with this helm install, I don't need to install nvidia-container-runtime and config ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED on the host anymore, and the toolkit-daemonset will handle ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED inside container without any problem, right?

hoangtnm avatar Oct 18 '22 02:10 hoangtnm

I think your understanding is correct. Using the toolkit daemonset should configure the toolkit as required. Note that you would also need to set the ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS="true" for the toolkit to trigger it to consider volume mounts.

elezar avatar Oct 18 '22 08:10 elezar

@elezar For me, it solved the issue with the environment variable, however, if I do a small test with volume mounts enabled (configuration), it runs into errors:

cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "example.com/nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
        nvidia.com/mig-1g.6gb: 1
EOF
pod/cuda-vectoradd created

Pod logs:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Without volumes, it worked!

Alwinator avatar Oct 18 '22 08:10 Alwinator

@Alwinator could you please confirm your GPU operator version and values that you have applied to your helm deployment?

elezar avatar Oct 18 '22 08:10 elezar

OpenShift 4.10.35 with Kubernetes 1.23 OS on nodes: Red Hat Enterprise Linux CoreOS 410.84.202209231843-0 CRI-O 1.23.3-17.rhaos4.10.git016b1ca.el8 GPU Operator version: 22.9.0 Helm version: v3.6.3

Alwinator avatar Oct 18 '22 09:10 Alwinator

OpenShift 4.10.35 with Kubernetes 1.23 OS on nodes: Red Hat Enterprise Linux CoreOS 410.84.202209231843-0 CRI-O 1.23.3-17.rhaos4.10.git016b1ca.el8 GPU Operator version: 22.9.0 Helm version: v3.6.3

Thanks. And the values used to install the operator?

elezar avatar Oct 18 '22 09:10 elezar

The Nvidia Operator 22.9.0 is installed in the nvidia-gpu-operator namespace, with update approval automatic and the cluster policy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: nvidia-dcgm-exporter-custom-config
    enabled: true
    serviceMonitor:
      enabled: true
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    rollingUpdate:
      maxUnavailable: '1'
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
    env:
      - name: DEVICE_LIST_STRATEGY
        value: volume-mounts
  mig:
    strategy: mixed
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets: {}
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    env:
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
        value: 'false'
    installDir: /usr/local/nvidia
status:
  namespace: nvidia-gpu-operator
  state: ready

Alwinator avatar Oct 18 '22 10:10 Alwinator

@Alwinator as mentioned in https://github.com/NVIDIA/gpu-operator/issues/421#issuecomment-1281983276 could you ALSO please set the ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS="true" in the toolkit.env and see if this addresses your issue?

What is happening now is that the device plugin is specifying devices as volume mounts, but the NVIDIA Container Runtime Hook is not configured to interpret these as device requests and most likely does not make any modifications to the container.

elezar avatar Oct 18 '22 10:10 elezar

@elezar Thank you for your support. I am sorry, I missed this one. With the new variable set, it works seamlessly. Also, thank you for the explanation that makes a lot of sense. :)

Alwinator avatar Oct 18 '22 11:10 Alwinator

Hi @elezar @shivamerla, I tried to install gpu-operator with the following params but it doesn't work. Do you have any idea?

➜  manifests git:(f038f81) helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set toolkit.env[0].value="false" \
  --wait
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
zsh: no matches found: devicePlugin.env[0].name=DEVICE_LIST_STRATEGY

hoangtnm avatar Oct 19 '22 04:10 hoangtnm

@hoangtnm Can you try below command.

helm install nvidia/gpu-operator \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[0].value="false" \
  -- set toolkit.env[1].name= ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[1].value="true" \
  --wait

shivamerla avatar Oct 19 '22 04:10 shivamerla

Hi @shivamerla, I fixed the problem with bash instead of zsh. It seems that zsh cannot parse such arguments properly.

hoangtnm avatar Oct 19 '22 04:10 hoangtnm

@shivamerla @elezar I have a question about params for k8s-device-plugin, which is bundled inside gpu-operator. As I read Preventing unprivileged access to GPUs in Kubernetes document on page 4:

helm install \
    --version=0.7.0-rc.7 \
    --generate-name \
    --set securityContext.privileged=true \
    --set deviceListStrategy=volume-mounts \
    nvdp/nvidia-device-plugin

It sets securityContext.privileged=true along with deviceListStrategy=volume-mounts for the k8s-device-plugin v0.7.0-rc.7. Therefore, I wonder whether the securityContext is still necessary in the latest gpu-operator or is it configured with true by default?

hoangtnm avatar Oct 19 '22 15:10 hoangtnm

@hoangtnm yes, with gpu-operator, device-plugin is always deployed with privileged mode.

shivamerla avatar Oct 19 '22 15:10 shivamerla

@shivamerla thank you for your prompt clarification :D

hoangtnm avatar Oct 19 '22 15:10 hoangtnm

Closing this as original issue has been resolved.

cdesiniotis avatar Nov 29 '22 01:11 cdesiniotis

@shivamerla hi,i meet exactly the same problem, one pod limited by "--limits=nvdia.com/mig-1g.5gb=1" can get all gpus in that node,then

  1. i modify /etc/nvidia-container-runtime/config.toml image
  2. use helm by this params helm install nvidia/gpu-operator
    --generate-name --create-namespace --namespace=gpu-operator-resources --set driver.enabled=false
    --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY
    --set devicePlugin.env[0].value="volume-mounts"
    --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
    --set-string toolkit.env[0].value="false"
    --set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
    --set-string toolkit.env[1].value="true"
    --set mig.strategy=mixed --wait
  3. run pod kubectl run -it --rm --image=nvidia/cuda:11.0-base --restart=Never --limits=nvidia.com/mig-1g.5gb=1 mig-none-example -- bash -c "nvidia-smi -L;sleep infinity" it seems that i can't use nvidia-smi in that pod image

ytaoeer avatar Jun 22 '23 10:06 ytaoeer

it seems to be more confusing. when i use "kubectl run", i get step 3 result. but when i use "kubectl apply -f test.yaml", everything is ok

ytaoeer avatar Jul 01 '23 12:07 ytaoeer