gpu-operator
gpu-operator copied to clipboard
nvidia/gpu-operator exposes all GPUs in Kubefllow
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
Hi, I'm deploying Kubeflow Manifests with nvidia/gpu-operator and doing experiments in Jupyter notebooks. However, even though I don't specify the Number of GPUs, which expects no GPUs to be exposed, all GPUs are still exposed in the notebooks. I don't how which causes the problem or any workaround for this one. Do you have any idea?
2. Steps to reproduce the issue
helm install nvidia/gpu-operator \
--version=v22.9.0 \
--generate-name \
--create-namespace \
--namespace=gpu-operator-resources \
--set driver.enabled=false
@hoangtnm This will happen when NVIDIA_VISIBLE_DEVICES=all environment variable is set in the image you are using (which is true for most of the cuda images). Please refer to this document on ways to prevent using this env without privileged mode.
@shivamerla Thank you for your prompt reply. As I understand your document, these following modifications should be made into /etc/nvidia-container-runtime/config.toml:
accept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = false
However, when I read another document on page 3 by @klueska, the modifications should be:
accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true
Therefore, which one is the right way for this problem?
@hoangtnm the document that was linked by @shivamerla describes the options and their behaviour. If you want to prevent the use of the NVIDIA_VISIBLE_DEVICES environment variable in unprivileged containers, you will have to set:
accept-nvidia-visible-devices-envvar-when-unprivileged = false
If you then want the device plugin to use volume mounts (and not the envvar) to specify devices you additionally need to set:
accept-nvidia-visible-devices-as-volume-mounts = true
And ensure that the device pluging is tarted with the device-list-strategy=volume-mounts option set.
to make this clear, with operator following params needs to be set during helm install.
devicePlugin:
env:
...
- name: DEVICE_LIST_STRATEGY
value: volume-mounts
toolkit:
env:
...
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
Hi @elezar @shivamerla, thank you for your prompt replies, they helped my a lot :D However, I just want to confirm the steps along with how to apply those params during gpu-operator helm install:
helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update \
&& helm install nvidia/gpu-operator \
--version=v22.9.0 \
--generate-name \
--create-namespace \
--namespace=gpu-operator-resources \
--set driver.enabled=false \
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
--set devicePlugin.env[0].value="volume-mounts" \
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
--set toolkit.env[0].value="false" \
--wait
Is this correct or do I miss something to make it work? Besides, as I understand with this helm install, I don't need to install nvidia-container-runtime and config ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED on the host anymore, and the toolkit-daemonset will handle ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED inside container without any problem, right?
I think your understanding is correct. Using the toolkit daemonset should configure the toolkit as required. Note that you would also need to set the ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS="true" for the toolkit to trigger it to consider volume mounts.
@elezar For me, it solved the issue with the environment variable, however, if I do a small test with volume mounts enabled (configuration), it runs into errors:
cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "example.com/nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/mig-1g.6gb: 1
EOF
pod/cuda-vectoradd created
Pod logs:
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]
Without volumes, it worked!
@Alwinator could you please confirm your GPU operator version and values that you have applied to your helm deployment?
OpenShift 4.10.35 with Kubernetes 1.23 OS on nodes: Red Hat Enterprise Linux CoreOS 410.84.202209231843-0 CRI-O 1.23.3-17.rhaos4.10.git016b1ca.el8 GPU Operator version: 22.9.0 Helm version: v3.6.3
OpenShift 4.10.35 with Kubernetes 1.23 OS on nodes: Red Hat Enterprise Linux CoreOS 410.84.202209231843-0 CRI-O 1.23.3-17.rhaos4.10.git016b1ca.el8 GPU Operator version: 22.9.0 Helm version: v3.6.3
Thanks. And the values used to install the operator?
The Nvidia Operator 22.9.0 is installed in the nvidia-gpu-operator namespace, with update approval automatic and the cluster policy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: nvidia-dcgm-exporter-custom-config
enabled: true
serviceMonitor:
enabled: true
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: ''
nlsEnabled: false
repoConfig:
configMapName: ''
rollingUpdate:
maxUnavailable: '1'
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
env:
- name: DEVICE_LIST_STRATEGY
value: volume-mounts
mig:
strategy: mixed
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'true'
nodeStatusExporter:
enabled: true
daemonsets: {}
sandboxWorkloads:
defaultWorkload: container
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
enabled: true
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: 'false'
installDir: /usr/local/nvidia
status:
namespace: nvidia-gpu-operator
state: ready
@Alwinator as mentioned in https://github.com/NVIDIA/gpu-operator/issues/421#issuecomment-1281983276 could you ALSO please set the ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS="true" in the toolkit.env and see if this addresses your issue?
What is happening now is that the device plugin is specifying devices as volume mounts, but the NVIDIA Container Runtime Hook is not configured to interpret these as device requests and most likely does not make any modifications to the container.
@elezar Thank you for your support. I am sorry, I missed this one. With the new variable set, it works seamlessly. Also, thank you for the explanation that makes a lot of sense. :)
Hi @elezar @shivamerla, I tried to install gpu-operator with the following params but it doesn't work. Do you have any idea?
➜ manifests git:(f038f81) helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update \
&& helm install nvidia/gpu-operator \
--version=v22.9.0 \
--generate-name \
--create-namespace \
--namespace=gpu-operator-resources \
--set driver.enabled=false \
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
--set devicePlugin.env[0].value="volume-mounts" \
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
--set toolkit.env[0].value="false" \
--wait
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
zsh: no matches found: devicePlugin.env[0].name=DEVICE_LIST_STRATEGY
@hoangtnm Can you try below command.
helm install nvidia/gpu-operator \
--version=v22.9.0 \
--generate-name \
--create-namespace \
--namespace=gpu-operator-resources \
--set driver.enabled=false \
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
--set devicePlugin.env[0].value="volume-mounts" \
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
--set-string toolkit.env[0].value="false" \
-- set toolkit.env[1].name= ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
--set-string toolkit.env[1].value="true" \
--wait
Hi @shivamerla, I fixed the problem with bash instead of zsh. It seems that zsh cannot parse such
arguments properly.
@shivamerla @elezar I have a question about params for k8s-device-plugin, which is bundled inside gpu-operator. As I read Preventing unprivileged access to GPUs in Kubernetes document on page 4:
helm install \
--version=0.7.0-rc.7 \
--generate-name \
--set securityContext.privileged=true \
--set deviceListStrategy=volume-mounts \
nvdp/nvidia-device-plugin
It sets securityContext.privileged=true along with deviceListStrategy=volume-mounts for the k8s-device-plugin v0.7.0-rc.7. Therefore, I wonder whether the securityContext is still necessary in the latest gpu-operator or is it configured with true by default?
@hoangtnm yes, with gpu-operator, device-plugin is always deployed with privileged mode.
@shivamerla thank you for your prompt clarification :D
Closing this as original issue has been resolved.
@shivamerla hi,i meet exactly the same problem, one pod limited by "--limits=nvdia.com/mig-1g.5gb=1" can get all gpus in that node,then
- i modify /etc/nvidia-container-runtime/config.toml
- use helm by this params
helm install nvidia/gpu-operator
--generate-name --create-namespace --namespace=gpu-operator-resources --set driver.enabled=false
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY
--set devicePlugin.env[0].value="volume-mounts"
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
--set-string toolkit.env[0].value="false"
--set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
--set-string toolkit.env[1].value="true"
--set mig.strategy=mixed --wait - run pod
kubectl run -it --rm --image=nvidia/cuda:11.0-base --restart=Never --limits=nvidia.com/mig-1g.5gb=1 mig-none-example -- bash -c "nvidia-smi -L;sleep infinity"
it seems that i can't use nvidia-smi in that pod
it seems to be more confusing. when i use "kubectl run", i get step 3 result. but when i use "kubectl apply -f test.yaml", everything is ok