Issue with the nvidia-device-plugin-daemonset error mounting /run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: 5.15.0-112-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
- GPU Operator Version: v24.3.0
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
I have an issue with the nvidia-device-plugin-daemonset
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
4. Information to attach (optional if deemed irrelevant)
- [ kubectl get pods -n operators
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-x7lhm 1/1 Running 0 14m
gpu-feature-discovery-zzc4q 1/1 Running 0 14m
gpu-operator-7bdd6886bf-gnxqv 1/1 Running 0 42h
gpu-operator-node-feature-discovery-gc-79d6d968bb-gjzj4 1/1 Running 0 42h
gpu-operator-node-feature-discovery-master-974477bb5-tg95d 1/1 Running 0 42h
gpu-operator-node-feature-discovery-worker-9qxqp 1/1 Running 0 42h
gpu-operator-node-feature-discovery-worker-chknw 1/1 Running 1 (41h ago) 42h
nvidia-container-toolkit-daemonset-b94jd 1/1 Running 0 41h
nvidia-container-toolkit-daemonset-sztgn 1/1 Running 0 41h
nvidia-cuda-validator-c2llz 0/1 Completed 0 41h
nvidia-cuda-validator-zflt9 0/1 Completed 0 42h
nvidia-dcgm-exporter-sbznq 1/1 Running 0 42h
nvidia-dcgm-exporter-v66wt 1/1 Running 0 41h
nvidia-device-plugin-daemonset-26gpb 0/1 CrashLoopBackOff 287 (3m32s ago) 24h
nvidia-device-plugin-daemonset-q7jwl 0/1 CrashLoopBackOff 288 (89s ago) 24h
nvidia-driver-daemonset-6v59r 1/1 Running 1 (41h ago) 42h
nvidia-driver-daemonset-8fmhn 1/1 Running 0 42h
nvidia-mig-manager-4jgzz 1/1 Running 0 41h
nvidia-mig-manager-8t8cp 1/1 Running 0 42h
nvidia-operator-validator-cbwz9 0/1 Init:CrashLoopBackOff 334 (49s ago) 42h
nvidia-operator-validator-cnqqz 0/1 Init:CrashLoopBackOff 333 (4m7s ago) 41h] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - [kubectl get ds -n operators 1
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 2 2 2 2 2 nvidia.com/gpu.deploy.gpu-feature-discovery=true 42h
gpu-operator-node-feature-discovery-worker 2 2 2 2 2 accelerator=nvidia-a100-pcie-80gb 42h
nvidia-container-toolkit-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.container-toolkit=true 42h
nvidia-dcgm-exporter 2 2 2 2 2 nvidia.com/gpu.deploy.dcgm-exporter=true 42h
nvidia-device-plugin-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.device-plugin=true 42h
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 42h
nvidia-driver-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.driver=true 42h
nvidia-mig-manager 2 2 2 2 2 nvidia.com/gpu.deploy.mig-manager=true 42h
nvidia-operator-validator 2 2 0 2 0 nvidia.com/gpu.deploy.operator-validator=true 42h ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - [kubectl describe pod nvidia-device-plugin-daemonset-26gpb -n operators
Name: nvidia-device-plugin-daemonset-26gpb
Namespace: operators
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-device-plugin
Node: 10.23.29.206/10.23.29.206
Start Time: Thu, 20 Jun 2024 13:00:51 +0300
Labels: app=nvidia-device-plugin-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=67b7bb9bb
helm.sh/chart=gpu-operator-v24.3.0
pod-template-generation=22
Annotations:
Status: Running IP: 172.31.4.172 IPs: IP: 172.31.4.172 Controlled By: DaemonSet/nvidia-device-plugin-daemonset Init Containers: toolkit-validation: Container ID: containerd://e740b47deae826518e8a175ac1cd6da46e357266157087b45dfee218afcb3809 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:2edc1d4ed555830e70010c82558936198f5faa86fc29ecf5698219145102cfcc Port: Host Port: Command: sh -c Args: until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 20 Jun 2024 13:00:52 +0300 Finished: Thu, 20 Jun 2024 13:00:52 +0300 Ready: True Restart Count: 0 Environment: Mounts: /run/nvidia from run-nvidia (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bn4jn (ro) Containers: nvidia-device-plugin: Container ID: containerd://62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7 Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:1aff0e9f0759758f87cb158d78241472af3a76cdc631f01ab395f997fa80f707 Port: Host Port: Command: /bin/bash -c Args: /bin/entrypoint.sh State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: Others("other error: container_linux.go:340: starting container process caused "process_linux.go:380: container init caused \"rootfs_linux.go:61: mounting \\\"/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\\" to rootfs \\\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs\\\" at \\\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\\" caused \\\"not a directory\\\"\"""): unknown Exit Code: 128 Started: Thu, 01 Jan 1970 03:00:00 +0300 Finished: Fri, 21 Jun 2024 13:11:32 +0300 Ready: False Restart Count: 288 Environment: PASS_DEVICE_SPECS: true FAIL_ON_INIT_ERROR: true DEVICE_LIST_STRATEGY: envvar DEVICE_ID_STRATEGY: uuid NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: all MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: mixed NVIDIA_MIG_MONITOR_DEVICES: all Mounts: /bin/entrypoint.sh from nvidia-device-plugin-entrypoint (ro,path="entrypoint.sh") /dev/shm from mps-shm (rw) /host from host-root (ro) /mps from mps-root (rw) /run/nvidia from run-nvidia (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bn4jn (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nvidia-device-plugin-entrypoint: Type: ConfigMap (a volume populated by a ConfigMap) Name: nvidia-device-plugin-entrypoint Optional: false device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType:
run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: Directory host-root: Type: HostPath (bare host directory volume) Path: / HostPathType:
cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType:
kube-api-access-bn4jn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional:DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.device-plugin=true Tolerations: gpu=dedicated:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message
Normal Pulled 21m (x285 over 24h) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04" already present on machine
Warning BackOffStart 98s (x6670 over 24h) kubelet Back-off restarting failed container nvidia-device-plugin in pod nvidia-device-plugin-daemonset-26gpb_operators(13f0751a-487d-4d4a-bc69-fa89fef819cf) ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
- [kubectl logs -n operators nvidia-device-plugin-daemonset-26gpb --all-containers
libcontainer: container init failed to execcontainer_linux.go:340: starting container process caused "process_linux.go:380: container init caused "rootfs_linux.go:61: mounting \"/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\" to rootfs \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs\" at \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\" caused \"not a directory\""" ] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [ nvidia-smi
Fri Jun 21 10:18:27 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe On | 00000000:00:0D.0 Off | On | | N/A 36C P0 70W / 300W | 1MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | 0 0 0 0 | 1MiB / 81221MiB | 98 0 | 7 0 5 1 1 | | | 1MiB / 131072MiB | | | +------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
- [ ] containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
@adwiza can you provide details on how you are installing gpu-operator and the complete clusterpolicy configuration?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This issue has been open for over 90 days without recent updates, and the context may now be outdated. More details were requested in https://github.com/NVIDIA/gpu-operator/issues/778#issuecomment-2224113067 but there has been no update since then. Hence, closing this issue.
If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.