Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2"
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS
- Kernel Version: linux 6.1
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
2. Issue or feature description
I am in progress of adding support: https://github.com/NVIDIA/gpu-operator/issues/659
With below change: https://github.com/NVIDIA/gpu-operator/compare/master...Dragoncell:gpu-operator:master-gke
the device plugin works well with version v0.14.5
helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set operator.version=v0418 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set toolkit.version=v4 --set validator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set validator.version=v0412_3 --set devicePlugin.version=v0.14.5
a) Pod is running
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-rr2x2 1/1 Running 0 4h16m
gpu-operator-66575c8958-sslch 1/1 Running 0 4h16m
noperator-node-feature-discovery-gc-6968c7c64-g7w7r 1/1 Running 0 4h16m
noperator-node-feature-discovery-master-749679f664-dvs48 1/1 Running 0 4h16m
noperator-node-feature-discovery-worker-glhxw 1/1 Running 0 4h16m
nvidia-container-toolkit-daemonset-wvpvx 1/1 Running 0 4h16m
nvidia-cuda-validator-z84ks 0/1 Completed 0 4h15m
nvidia-dcgm-exporter-9r87v 1/1 Running 0 4h16m
nvidia-device-plugin-daemonset-fp7hm 1/1 Running 0 4h16m
nvidia-operator-validator-hstkb 1/1 Running 0 4h16m
b) nvidia-smi workload works well
$ cat test-pod-smi.yaml
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-base-ubuntu20.04
command: ["bash", "-c"]
args:
- |-
export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
nvidia-smi;
resources:
limits:
nvidia.com/gpu: "1"
$ kubectl logs my-gpu-pod
Thu Apr 4 23:16:57 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 36C P0 16W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
However, if switched to v0.15.0-rc.2, hit the below error in the device plugin log, and it is in a crash loop:
E0404 18:39:43.479163 1 main.go:132] error starting plugins: error creating plugin manager: unable to create cdi spec file: failed to get CDI spec: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.129.03: pattern libcuda.so.535.129.03 not found
It used the same configuration for the device plugin like below
NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/kubernetes/bin/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/kubernetes/bin/nvidia/bin
NVIDIA_DRIVER_ROOT=/ is used to discover the device, and the PATH/LD_LIBRARY_PATH is used to discover lib ( libcuda.so.535.129.03 is actually under /home/kubernetes/bin/nvidia/lib64)
Is there something changed in the new version which caused this error ? Thanks
/cc @cdesiniotis @elezar @bobbypage
This should be addressed by #666.