gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Issue with the nvidia-device-plugin-daemonset error mounting /run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1

Open adwiza opened this issue 1 year ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: 5.15.0-112-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: v24.3.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

I have an issue with the nvidia-device-plugin-daemonset

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

4. Information to attach (optional if deemed irrelevant)

  • [ kubectl get pods -n operators NAME READY STATUS RESTARTS AGE gpu-feature-discovery-x7lhm 1/1 Running 0 14m gpu-feature-discovery-zzc4q 1/1 Running 0 14m gpu-operator-7bdd6886bf-gnxqv 1/1 Running 0 42h gpu-operator-node-feature-discovery-gc-79d6d968bb-gjzj4 1/1 Running 0 42h gpu-operator-node-feature-discovery-master-974477bb5-tg95d 1/1 Running 0 42h gpu-operator-node-feature-discovery-worker-9qxqp 1/1 Running 0 42h gpu-operator-node-feature-discovery-worker-chknw 1/1 Running 1 (41h ago) 42h nvidia-container-toolkit-daemonset-b94jd 1/1 Running 0 41h nvidia-container-toolkit-daemonset-sztgn 1/1 Running 0 41h nvidia-cuda-validator-c2llz 0/1 Completed 0 41h nvidia-cuda-validator-zflt9 0/1 Completed 0 42h nvidia-dcgm-exporter-sbznq 1/1 Running 0 42h nvidia-dcgm-exporter-v66wt 1/1 Running 0 41h nvidia-device-plugin-daemonset-26gpb 0/1 CrashLoopBackOff 287 (3m32s ago) 24h nvidia-device-plugin-daemonset-q7jwl 0/1 CrashLoopBackOff 288 (89s ago) 24h nvidia-driver-daemonset-6v59r 1/1 Running 1 (41h ago) 42h nvidia-driver-daemonset-8fmhn 1/1 Running 0 42h nvidia-mig-manager-4jgzz 1/1 Running 0 41h nvidia-mig-manager-8t8cp 1/1 Running 0 42h nvidia-operator-validator-cbwz9 0/1 Init:CrashLoopBackOff 334 (49s ago) 42h nvidia-operator-validator-cnqqz 0/1 Init:CrashLoopBackOff 333 (4m7s ago) 41h] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • [kubectl get ds -n operators 1 NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 2 2 2 2 2 nvidia.com/gpu.deploy.gpu-feature-discovery=true 42h gpu-operator-node-feature-discovery-worker 2 2 2 2 2 accelerator=nvidia-a100-pcie-80gb 42h nvidia-container-toolkit-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.container-toolkit=true 42h nvidia-dcgm-exporter 2 2 2 2 2 nvidia.com/gpu.deploy.dcgm-exporter=true 42h nvidia-device-plugin-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.device-plugin=true 42h nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 42h nvidia-driver-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.driver=true 42h nvidia-mig-manager 2 2 2 2 2 nvidia.com/gpu.deploy.mig-manager=true 42h nvidia-operator-validator 2 2 0 2 0 nvidia.com/gpu.deploy.operator-validator=true 42h ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • [kubectl describe pod nvidia-device-plugin-daemonset-26gpb -n operators Name: nvidia-device-plugin-daemonset-26gpb Namespace: operators Priority: 2000001000 Priority Class Name: system-node-critical Runtime Class Name: nvidia Service Account: nvidia-device-plugin Node: 10.23.29.206/10.23.29.206 Start Time: Thu, 20 Jun 2024 13:00:51 +0300 Labels: app=nvidia-device-plugin-daemonset app.kubernetes.io/managed-by=gpu-operator controller-revision-hash=67b7bb9bb helm.sh/chart=gpu-operator-v24.3.0 pod-template-generation=22 Annotations: Status: Running IP: 172.31.4.172 IPs: IP: 172.31.4.172 Controlled By: DaemonSet/nvidia-device-plugin-daemonset Init Containers: toolkit-validation: Container ID: containerd://e740b47deae826518e8a175ac1cd6da46e357266157087b45dfee218afcb3809 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:2edc1d4ed555830e70010c82558936198f5faa86fc29ecf5698219145102cfcc Port: Host Port: Command: sh -c Args: until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 20 Jun 2024 13:00:52 +0300 Finished: Thu, 20 Jun 2024 13:00:52 +0300 Ready: True Restart Count: 0 Environment: Mounts: /run/nvidia from run-nvidia (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bn4jn (ro) Containers: nvidia-device-plugin: Container ID: containerd://62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7 Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04 Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:1aff0e9f0759758f87cb158d78241472af3a76cdc631f01ab395f997fa80f707 Port: Host Port: Command: /bin/bash -c Args: /bin/entrypoint.sh State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: Others("other error: container_linux.go:340: starting container process caused "process_linux.go:380: container init caused \"rootfs_linux.go:61: mounting \\\"/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\\" to rootfs \\\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs\\\" at \\\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\\" caused \\\"not a directory\\\"\"""): unknown Exit Code: 128 Started: Thu, 01 Jan 1970 03:00:00 +0300 Finished: Fri, 21 Jun 2024 13:11:32 +0300 Ready: False Restart Count: 288 Environment: PASS_DEVICE_SPECS: true FAIL_ON_INIT_ERROR: true DEVICE_LIST_STRATEGY: envvar DEVICE_ID_STRATEGY: uuid NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: all MPS_ROOT: /run/nvidia/mps MIG_STRATEGY: mixed NVIDIA_MIG_MONITOR_DEVICES: all Mounts: /bin/entrypoint.sh from nvidia-device-plugin-entrypoint (ro,path="entrypoint.sh") /dev/shm from mps-shm (rw) /host from host-root (ro) /mps from mps-root (rw) /run/nvidia from run-nvidia (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/cdi from cdi-root (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bn4jn (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nvidia-device-plugin-entrypoint: Type: ConfigMap (a volume populated by a ConfigMap) Name: nvidia-device-plugin-entrypoint Optional: false device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType:
    run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: Directory host-root: Type: HostPath (bare host directory volume) Path: / HostPathType:
    cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType:
    kube-api-access-bn4jn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.device-plugin=true Tolerations: gpu=dedicated:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message

Normal Pulled 21m (x285 over 24h) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04" already present on machine Warning BackOffStart 98s (x6670 over 24h) kubelet Back-off restarting failed container nvidia-device-plugin in pod nvidia-device-plugin-daemonset-26gpb_operators(13f0751a-487d-4d4a-bc69-fa89fef819cf) ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

  • [kubectl logs -n operators nvidia-device-plugin-daemonset-26gpb --all-containers libcontainer: container init failed to execcontainer_linux.go:340: starting container process caused "process_linux.go:380: container init caused "rootfs_linux.go:61: mounting \"/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\" to rootfs \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs\" at \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\" caused \"not a directory\""" ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • [ nvidia-smi Fri Jun 21 10:18:27 2024
    +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe On | 00000000:00:0D.0 Off | On | | N/A 36C P0 70W / 300W | 1MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | 0 0 0 0 | 1MiB / 81221MiB | 98 0 | 7 0 5 1 1 | | | 1MiB / 131072MiB | | | +------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

  • [ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

adwiza avatar Jun 21 '24 10:06 adwiza

@adwiza can you provide details on how you are installing gpu-operator and the complete clusterpolicy configuration?

cdesiniotis avatar Jul 11 '24 23:07 cdesiniotis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated. More details were requested in https://github.com/NVIDIA/gpu-operator/issues/778#issuecomment-2224113067 but there has been no update since then. Hence, closing this issue.

If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

rahulait avatar Nov 14 '25 07:11 rahulait