gpu-operator
gpu-operator copied to clipboard
nvidia-operator-validator toolkit-validation fails Init:CrashLoopBackOff
Hello We are trying to configure GPU operator v24.9.2 in an RKE2 v1.31.4+rke2r1. The validator POD fails trying to validate toolkit with the below error:
Warning Failed 3m29s (x4 over 4m51s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-container-go.so.1: cannot open shared object file: no such file or directory: unknown
Warning BackOff 3m3s (x9 over 4m51s) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-m8tcr_gpu-operator(f58b31b2-9d43-44db-a2be-b77a0663a84c)
Here is the location of the binary nvidia-container-cli.real
[toolkit]# pwd
/usr/local/nvidia/toolkit
[toolkit]# ls -la
total 38260
drwxr-xr-x. 3 root root 4096 Mar 25 15:35 .
drwxr-xr-x. 3 root root 21 Mar 25 15:35 ..
drwxr-xr-x. 3 root root 38 Mar 25 15:35 .config
lrwxrwxrwx. 1 root root 32 Mar 25 15:35 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.17.5
-rwxr-xr-x. 1 root root 2959416 Mar 25 15:35 libnvidia-container-go.so.1.17.5
lrwxrwxrwx. 1 root root 29 Mar 25 15:35 libnvidia-container.so.1 -> libnvidia-container.so.1.17.5
-rwxr-xr-x. 1 root root 205272 Mar 25 15:35 libnvidia-container.so.1.17.5
-rwxr-xr-x. 1 root root 105 Mar 25 15:35 nvidia-cdi-hook
-rwxr-xr-x. 1 root root 5382536 Mar 25 15:35 nvidia-cdi-hook.real
-rwxr-xr-x. 1 root root 154 Mar 25 15:35 nvidia-container-cli
-rwxr-xr-x. 1 root root 55632 Mar 25 15:35 nvidia-container-cli.real
-rwxr-xr-x. 1 root root 342 Mar 25 15:35 nvidia-container-runtime
-rwxr-xr-x. 1 root root 346 Mar 25 15:35 nvidia-container-runtime.cdi
-rwxr-xr-x. 1 root root 5665032 Mar 25 15:35 nvidia-container-runtime.cdi.real
-rwxr-xr-x. 1 root root 203 Mar 25 15:35 nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root 4135784 Mar 25 15:35 nvidia-container-runtime-hook.real
-rwxr-xr-x. 1 root root 349 Mar 25 15:35 nvidia-container-runtime.legacy
-rwxr-xr-x. 1 root root 5665064 Mar 25 15:35 nvidia-container-runtime.legacy.real
-rwxr-xr-x. 1 root root 5665032 Mar 25 15:35 nvidia-container-runtime.real
lrwxrwxrwx. 1 root root 29 Mar 25 15:35 nvidia-container-toolkit -> nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root 100 Mar 25 15:35 nvidia-ctk
-rwxr-xr-x. 1 root root 9384600 Mar 25 15:35 nvidia-ctk.real
[ toolkit]#
Note: I had to upgrade container-toolkit version from v1.17.4 to v1.71.5 because in the RHEL version, it was missing nvidia-container-runtime.cdi and nvidia-container-runtime.cdi.real files.
What is the meaning of the below error message?
cannot open shared object file: no such file or directory: unknown