gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

nvidia-operator-validator toolkit-validation fails Init:CrashLoopBackOff

Open RangaSamudrala opened this issue 9 months ago • 3 comments

Hello We are trying to configure GPU operator v24.9.2 in an RKE2 v1.31.4+rke2r1. The validator POD fails trying to validate toolkit with the below error:

 Warning  Failed     3m29s (x4 over 4m51s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-container-go.so.1: cannot open shared object file: no such file or directory: unknown
 Warning  BackOff    3m3s (x9 over 4m51s)   kubelet            Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-m8tcr_gpu-operator(f58b31b2-9d43-44db-a2be-b77a0663a84c)

Here is the location of the binary nvidia-container-cli.real

[toolkit]# pwd
/usr/local/nvidia/toolkit
[toolkit]# ls -la
total 38260
drwxr-xr-x. 3 root root    4096 Mar 25 15:35 .
drwxr-xr-x. 3 root root      21 Mar 25 15:35 ..
drwxr-xr-x. 3 root root      38 Mar 25 15:35 .config
lrwxrwxrwx. 1 root root      32 Mar 25 15:35 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.17.5
-rwxr-xr-x. 1 root root 2959416 Mar 25 15:35 libnvidia-container-go.so.1.17.5
lrwxrwxrwx. 1 root root      29 Mar 25 15:35 libnvidia-container.so.1 -> libnvidia-container.so.1.17.5
-rwxr-xr-x. 1 root root  205272 Mar 25 15:35 libnvidia-container.so.1.17.5
-rwxr-xr-x. 1 root root     105 Mar 25 15:35 nvidia-cdi-hook
-rwxr-xr-x. 1 root root 5382536 Mar 25 15:35 nvidia-cdi-hook.real
-rwxr-xr-x. 1 root root     154 Mar 25 15:35 nvidia-container-cli
-rwxr-xr-x. 1 root root   55632 Mar 25 15:35 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     342 Mar 25 15:35 nvidia-container-runtime
-rwxr-xr-x. 1 root root     346 Mar 25 15:35 nvidia-container-runtime.cdi
-rwxr-xr-x. 1 root root 5665032 Mar 25 15:35 nvidia-container-runtime.cdi.real
-rwxr-xr-x. 1 root root     203 Mar 25 15:35 nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root 4135784 Mar 25 15:35 nvidia-container-runtime-hook.real
-rwxr-xr-x. 1 root root     349 Mar 25 15:35 nvidia-container-runtime.legacy
-rwxr-xr-x. 1 root root 5665064 Mar 25 15:35 nvidia-container-runtime.legacy.real
-rwxr-xr-x. 1 root root 5665032 Mar 25 15:35 nvidia-container-runtime.real
lrwxrwxrwx. 1 root root      29 Mar 25 15:35 nvidia-container-toolkit -> nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root     100 Mar 25 15:35 nvidia-ctk
-rwxr-xr-x. 1 root root 9384600 Mar 25 15:35 nvidia-ctk.real
[ toolkit]#

Note: I had to upgrade container-toolkit version from v1.17.4 to v1.71.5 because in the RHEL version, it was missing nvidia-container-runtime.cdi and nvidia-container-runtime.cdi.real files.

What is the meaning of the below error message?

 cannot open shared object file: no such file or directory: unknown

RangaSamudrala avatar Mar 25 '25 20:03 RangaSamudrala