gpu-operator
gpu-operator copied to clipboard
Permissions issues: `initialization error: nvml error: insufficient permissions`
Main issue: not able to use GPU inside minikube due to permission issues.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):
>> uname -a
Linux xxx 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- Kernel Version:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
>> kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://kubernetes.docker.internal:6443
name: docker-desktop
- cluster:
certificate-authority: /home/leo/.minikube/ca.crt
extensions:
- extension:
last-update: Mon, 11 Mar 2024 13:40:51 CET
provider: minikube.sigs.k8s.io
version: v1.32.0
name: cluster_info
server: https://192.168.49.2:8443
name: minikube
contexts:
- context:
cluster: docker-desktop
user: docker-desktop
name: docker-desktop
- context:
cluster: minikube
extensions:
- extension:
last-update: Mon, 11 Mar 2024 13:40:51 CET
provider: minikube.sigs.k8s.io
version: v1.32.0
name: context_info
namespace: default
user: minikube
name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: docker-desktop
user:
client-certificate-data: DATA+OMITTED
client-key-data: DATA+OMITTED
- name: minikube
user:
client-certificate: /home/leo/.minikube/profiles/minikube/client.crt
client-key: /home/leo/.minikube/profiles/minikube/client.key
- GPU Operator Version:
nvidia-smi
Mon Mar 11 13:55:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P8 13W / 80W | 1353MiB / 8192MiB | 13% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3888 G /usr/lib/xorg/Xorg 416MiB |
| 0 N/A N/A 4220 G /usr/bin/gnome-shell 113MiB |
| 0 N/A N/A 7233 G ...irefox/3941/usr/lib/firefox/firefox 476MiB |
| 0 N/A N/A 8787 G ...irefox/3941/usr/lib/firefox/firefox 151MiB |
| 0 N/A N/A 9794 G ...irefox/3941/usr/lib/firefox/firefox 41MiB |
| 0 N/A N/A 31467 G ...sion,SpareRendererForSitePerProcess 71MiB |
| 0 N/A N/A 116653 G ...,WinRetrieveSuggestionsOnlyOnDemand 31MiB |
| 0 N/A N/A 132611 C+G warp-terminal 20MiB |
+---------------------------------------------------------------------------------------+
2. Issue or feature description
using minikube, k8s, helm and gpu-operator.
i am getting:
Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown
for nvidia-operator-validator
3. Steps to reproduce the issue
I think i have something broken in my persmission/user setup and i am running out of ideas on how to resolve it.
4. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-qmktd 0/1 Init:0/1 0 9m28s
gpu-operator-574c687b59-pcjwr 1/1 Running 0 10m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9vvk8 1/1 Running 0 10m
gpu-operator-node-feature-discovery-master-d8597d549-qqkpv 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-xcwnx 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-r8ktc 1/1 Running 0 9m28s
nvidia-dcgm-exporter-mhxx4 0/1 Init:0/1 0 9m28s
nvidia-device-plugin-daemonset-v79cd 0/1 Init:0/1 0 9m28s
nvidia-operator-validator-ptj47 0/1 Init:CrashLoopBackOff 6 (3m28s ago) 9m28s
- [x] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 9m57s
gpu-operator-node-feature-discovery-worker 1 1 1 1 1 <none> 10m
nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 9m57s
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 9m57s
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 9m57s
nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 9m57s
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 9m57s
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 9m57s
- [x] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod -n gpu-operator nvidia-operator-validator-ptj47
Name: nvidia-operator-validator-ptj47
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-operator-validator
Node: minikube/192.168.49.2
Start Time: Mon, 11 Mar 2024 13:46:54 +0100
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=74c7484fb6
helm.sh/chart=gpu-operator-v23.9.2
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.244.0.16
IPs:
IP: 10.244.0.16
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: docker://871f1cc1d632838d5e168db3bfe66f10cba3c84c070366cd36c654955e891f6f
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 11 Mar 2024 13:47:03 +0100
Finished: Mon, 11 Mar 2024 13:47:03 +0100
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
toolkit-validation:
Container ID: docker://71762f7b569cd2ceba213aa845fe6c2598cec3889dfdf0902f9ef68f273cf622
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
Exit Code: 128
Started: Mon, 11 Mar 2024 13:52:54 +0100
Finished: Mon, 11 Mar 2024 13:52:54 +0100
Ready: False
Restart Count: 6
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-cjskm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-ptj47 to minikube
Normal Pulling 10m kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2"
Normal Pulled 10m kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 2.116s (7.99s including waiting)
Normal Created 10m kubelet Created container driver-validation
Normal Started 10m kubelet Started container driver-validation
Warning Failed 10m (x3 over 10m) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown
Normal Pulled 8m59s (x5 over 10m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" already present on machine
Normal Created 8m58s (x5 over 10m) kubelet Created container toolkit-validation
Warning Failed 8m58s (x2 over 9m49s) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
Warning BackOff 32s (x46 over 10m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-ptj47_gpu-operator(7c2a5005-4339-4674-82c7-244051860212)
- [x] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
kubectl logs -n gpu-operator nvidia-operator-validator-ptj47
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ptj47" is waiting to start: PodInitializing
- [x] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
not able to
- [x] containerd logs
journalctl -u containerd > containerd.log
it is huge and dont seem to have anything relevant. i can post it later if needed.
extra:
ls -l /dev/nvidia*
crw-rw---- 1 root vglusers 195, 0 Mar 11 11:24 /dev/nvidia0
crw-rw---- 1 root vglusers 195, 255 Mar 11 11:24 /dev/nvidiactl
crw-rw---- 1 root vglusers 195, 254 Mar 11 11:24 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508, 0 Mar 11 11:24 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508, 1 Mar 11 11:24 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Mar 11 11:30 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Mar 11 11:30 nvidia-cap2
getent group vglusers
vglusers:x:1002:leo,root
minikube ssh
docker@minikube:~$ ls -l /dev/nvidia*
crw-rw---- 1 root 1002 195, 254 Mar 11 12:40 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508, 0 Mar 11 10:24 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508, 1 Mar 11 10:24 /dev/nvidia-uvm-tools
crw-rw---- 1 root 1002 195, 0 Mar 11 10:24 /dev/nvidia0
crw-rw---- 1 root 1002 195, 255 Mar 11 10:24 /dev/nvidiactl
/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Mar 11 12:40 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Mar 11 12:40 nvidia-cap2
docker@minikube:~$
More informations:
docker run -it --rm --privileged -e DISPLAY=$DISPLAY --runtime=nvidia --gpus all -v /tmp/.X11-unix:/tmp/.X11-unix nvidia/cuda:11.6.2-base-ubuntu20.04 bash
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Already exists
a3d20efe6db8: Pull complete
bfdf8ce43b67: Pull complete
ad14f66bfcf9: Pull complete
1056ff735c59: Pull complete
Digest: sha256:a0dd581afdbf82ea9887dd077aebf9723aba58b51ae89acb4c58b8705b74179b
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
root@3e5d5a4aa5f2:/# nvidia-smi
Fri Mar 15 08:40:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 55C P0 33W / 80W | 2624MiB / 8192MiB | 55% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@3e5d5a4aa5f2:/#