nvidia-container-runtime prevents pods to terminate
It seems that it may happen, that /usr/local/nvidia/toolkit/nvidia-container-runtime fails it it runs from a directory that already does not exist. I can see the following in the kubelet.log
E0201 15:05:42.625590 39292 kuberuntime_container.go:744] "Kill container failed" err="rpc error: code = Unknown desc = failed to kill container \"f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9\": unknown error after kill: /usr/local/nvidia/toolkit/nvidia-container-runtime did not terminate successfully: exit status 1: shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\njob-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\njob-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\n2022/02/01 15:05:07 Error running [/usr/local/nvidia/toolkit/nvidia-container-runtime.real --root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9/log.json --log-format json kill f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9 9]: error creating runtime: error constructing OCI specification: error getting OCI specification file path: error getting working directory: getwd: no such file or directory\n: unknown" pod="boinc/k8s-cz-boinc-6cf8f5f494-8l778" podUID=9f46e764-35e6-43af-8690-9adfc5105248 containerName="boinc" containerID={Type:containerd ID:f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9}
fix for this seems to be trivial:
--- /usr/local/nvidia/toolkit/nvidia-container-runtime.orig 2022-02-01 15:57:07.824271267 +0100
+++ /usr/local/nvidia/toolkit/nvidia-container-runtime 2022-02-01 15:56:25.044263146 +0100
@@ -1,5 +1,7 @@
#! /bin/sh
+cd /
+
cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
if [ "${?}" != "0" ]; then
echo "nvidia driver modules are not yet loaded, invoking runc directly"
could it be merged?
@xhejtman this change would have to be made in the nvidia-container-runtime in the NVIDIA Container Toolkit repository.
Note that the call to os.Getwd there was removed in the following commit and as such, using a newer version of the container-toolkit image should no longer present the problem that you are seeing. Which version are you using?
container-toolkit:1.7.1-ubuntu18.04
@xhejtman that is using nvidia-container-runtime v3.5.0 which still contains the call to os.Getwd.
We are prepping for release of 1.8.0 of the container toolkit, and if this is an environment where you are able to experiment (i.e. non-production), you could try the container-toolkit:1.8.0-rc.2-ubuntu18.04 image to see whether this addresses the behaviour that you are seeing.
@xhejtman Please verify this with v1.9.0 of container-toolkit.