k8s-device-plugin Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2"

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS
Kernel Version: linux 6.1
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE

2. Issue or feature description

I am in progress of adding support: https://github.com/NVIDIA/gpu-operator/issues/659

With below change: https://github.com/NVIDIA/gpu-operator/compare/master...Dragoncell:gpu-operator:master-gke

the device plugin works well with version v0.14.5

helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set operator.version=v0418 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev)  --set toolkit.version=v4 --set validator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set validator.version=v0412_3 --set devicePlugin.version=v0.14.5

a) Pod is running

$ kubectl get pods -n gpu-operator
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rr2x2                                1/1     Running     0          4h16m
gpu-operator-66575c8958-sslch                              1/1     Running     0          4h16m
noperator-node-feature-discovery-gc-6968c7c64-g7w7r        1/1     Running     0          4h16m
noperator-node-feature-discovery-master-749679f664-dvs48   1/1     Running     0          4h16m
noperator-node-feature-discovery-worker-glhxw              1/1     Running     0          4h16m
nvidia-container-toolkit-daemonset-wvpvx                   1/1     Running     0          4h16m
nvidia-cuda-validator-z84ks                                0/1     Completed   0          4h15m
nvidia-dcgm-exporter-9r87v                                 1/1     Running     0          4h16m
nvidia-device-plugin-daemonset-fp7hm                       1/1     Running     0          4h16m
nvidia-operator-validator-hstkb                            1/1     Running     0          4h16m

b) nvidia-smi workload works well

$ cat test-pod-smi.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["bash", "-c"]
    args: 
    - |-
      export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
      nvidia-smi;
    resources:
      limits: 
        nvidia.com/gpu: "1"

$ kubectl logs  my-gpu-pod 
Thu Apr  4 23:16:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   36C    P0              16W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, if switched to v0.15.0-rc.2, hit the below error in the device plugin log, and it is in a crash loop:

E0404 18:39:43.479163       1 main.go:132] error starting plugins: error creating plugin manager: unable to create cdi spec file: failed to get CDI spec: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.129.03: pattern libcuda.so.535.129.03 not found

It used the same configuration for the device plugin like below

NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/kubernetes/bin/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/kubernetes/bin/nvidia/bin

NVIDIA_DRIVER_ROOT=/ is used to discover the device, and the PATH/LD_LIBRARY_PATH is used to discover lib ( libcuda.so.535.129.03 is actually under /home/kubernetes/bin/nvidia/lib64)

Is there something changed in the new version which caused this error ? Thanks

Apr 04 '24 23:04 Dragoncell

/cc @cdesiniotis @elezar @bobbypage

Apr 04 '24 23:04 Dragoncell

This should be addressed by #666.

Apr 23 '24 07:04 elezar