gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

nvidia-container-runtime prevents pods to terminate

Open xhejtman opened this issue 3 years ago • 4 comments

It seems that it may happen, that /usr/local/nvidia/toolkit/nvidia-container-runtime fails it it runs from a directory that already does not exist. I can see the following in the kubelet.log

E0201 15:05:42.625590   39292 kuberuntime_container.go:744] "Kill container failed" err="rpc error: code = Unknown desc = failed to kill container \"f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9\": unknown error after kill: /usr/local/nvidia/toolkit/nvidia-container-runtime did not terminate successfully: exit status 1: shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\njob-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\njob-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\n2022/02/01 15:05:07 Error running [/usr/local/nvidia/toolkit/nvidia-container-runtime.real --root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9/log.json --log-format json kill f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9 9]: error creating runtime: error constructing OCI specification: error getting OCI specification file path: error getting working directory: getwd: no such file or directory\n: unknown" pod="boinc/k8s-cz-boinc-6cf8f5f494-8l778" podUID=9f46e764-35e6-43af-8690-9adfc5105248 containerName="boinc" containerID={Type:containerd ID:f68442bfd29c61ecb03d1016d1c5291ed92d527f01a8f7229a1f744bbba8d0d9}

fix for this seems to be trivial:

--- /usr/local/nvidia/toolkit/nvidia-container-runtime.orig	2022-02-01 15:57:07.824271267 +0100
+++ /usr/local/nvidia/toolkit/nvidia-container-runtime	2022-02-01 15:56:25.044263146 +0100
@@ -1,5 +1,7 @@
 #! /bin/sh
 
+cd /
+
 cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
 if [ "${?}" != "0" ]; then
 	echo "nvidia driver modules are not yet loaded, invoking runc directly"

could it be merged?

xhejtman avatar Feb 01 '22 20:02 xhejtman

@xhejtman this change would have to be made in the nvidia-container-runtime in the NVIDIA Container Toolkit repository.

Note that the call to os.Getwd there was removed in the following commit and as such, using a newer version of the container-toolkit image should no longer present the problem that you are seeing. Which version are you using?

elezar avatar Feb 02 '22 11:02 elezar

container-toolkit:1.7.1-ubuntu18.04

xhejtman avatar Feb 02 '22 11:02 xhejtman

@xhejtman that is using nvidia-container-runtime v3.5.0 which still contains the call to os.Getwd.

We are prepping for release of 1.8.0 of the container toolkit, and if this is an environment where you are able to experiment (i.e. non-production), you could try the container-toolkit:1.8.0-rc.2-ubuntu18.04 image to see whether this addresses the behaviour that you are seeing.

elezar avatar Feb 02 '22 11:02 elezar

@xhejtman Please verify this with v1.9.0 of container-toolkit.

shivamerla avatar Mar 23 '22 17:03 shivamerla