gpu-operator Failed to get sandbox runtime: no runtime for nvidia is configured

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node?
[x] Are you running Kubernetes v1.13+?
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

nov 02 18:00:58 beck containerd[10237]: time="2022-11-02T18:00:58.738797825+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:gpu-feature-discovery-qfjgk,Uid:02c7d4ad-db02-4145-846b-616a94416008,Namespace:gpu-operator,Attempt:2,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods --all-namespaces
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime
[ ] NVIDIA shared directory: ls -la /run/nvidia
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver
[ ] kubelet logs journalctl -u kubelet > kubelet.logs

(base) beck@beck:/$ ls -la /run/nvidia/
total 4
drwxr-xr-x  4 root root  100 nov  2 18:48 .
drwxr-xr-x 39 root root 1140 nov  2 18:47 ..
drwxr-xr-x  2 root root   40 nov  2 17:59 driver
-rw-r--r--  1 root root    7 nov  2 18:48 toolkit.pid
drwxr-xr-x  2 root root   80 nov  2 18:48 validations

Driver folder is empty:

(base) beck@beck:/$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 nov  2 17:59 .
drwxr-xr-x 4 root root 80 nov  2 18:48 ..

Nov 02 '22 16:11 Bec-k

(base) beck@beck:/$ sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
    cuda-11.0.3-base-ubuntu20.04 nvidia-smi
Wed Nov  2 16:50:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0    46W /  N/A |    601MiB /  8192MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Nov 02 '22 16:11 Bec-k

When i launch nvidia/cuda image via containerd cli, it is correctly detects and outputs my Nvidia GeForce video card, but for some reason, it doesn't see inside pods when deployed via helm.

Nov 02 '22 16:11 Bec-k

Can you run kubectl get pods -n gpu-operator to show which pods are running. If you deployed with driver enabled, it takes 3-5 minutes for the drivers to be installed and followed by nvidia runtime setup. If you have already installed them on the host, please specify --set driver.enabled=false --set toolkit.enabled=false.

Nov 02 '22 16:11 shivamerla

I was checking /etc/containerd/config.toml , it is changing it contantly back and forth. containerd is always gets restarted by itself, because it fails to cleanup sandboxes and dead shims.

Nov 02 '22 16:11 Bec-k

(base) beck@beck:/$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS            RESTARTS      AGE
gpu-feature-discovery-w7vk6                                       1/1     Running           0             6m23s
gpu-operator-59b9d49c6f-7282l                                     1/1     Running           0             6m41s
nvidia-container-toolkit-daemonset-9rvz8                          1/1     Running           7 (67s ago)   5m56s
nvidia-cuda-validator-7mp9j                                       0/1     Init:0/1          0             4m4s
nvidia-dcgm-exporter-2ktzc                                        0/1     PodInitializing   0             6m24s
nvidia-device-plugin-daemonset-wvvh4                              0/1     PodInitializing   0             5m57s
nvidia-gpu-operator-node-feature-discovery-master-68495df8t9vd7   1/1     Running           0             6m41s
nvidia-gpu-operator-node-feature-discovery-worker-8gc88           1/1     Running           0             6m40s
nvidia-gpu-operator-node-feature-discovery-worker-stwpp           1/1     Running           9 (26s ago)   5m58s
nvidia-operator-validator-ptdgd                                   0/1     Init:Error        0             5m55s

Nov 02 '22 16:11 Bec-k

you can disable toolkit as well by editing kubectl edit clusterpolicy and setting toolkit.enabled=false. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?

Nov 02 '22 16:11 shivamerla

Can you also paste logs of nvidia-container-toolkit-daemonset-9rvz8 pod, curious as to why it is restarting. Which containerd and OS version is this?

Nov 02 '22 16:11 shivamerla

Nope, didn't help. I have updated it, pod was removed and still complaining about:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Nov 02 '22 17:11 Bec-k

I have removed all pods, to trigger everything from scratch.

Nov 02 '22 17:11 Bec-k

(base) beck@beck:/$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-8b8ls                                       0/1     Init:0/1   0          115s
gpu-operator-59b9d49c6f-gkk4j                                     1/1     Running    0          2m20s
nvidia-dcgm-exporter-6bmlt                                        0/1     Init:0/1   0          115s
nvidia-device-plugin-daemonset-f7xgb                              0/1     Init:0/1   0          117s
nvidia-gpu-operator-node-feature-discovery-master-68495df8kscw7   1/1     Running    0          2m20s
nvidia-gpu-operator-node-feature-discovery-worker-pcxwq           1/1     Running    0          2m20s
nvidia-gpu-operator-node-feature-discovery-worker-s2jjn           1/1     Running    0          2m20s
nvidia-operator-validator-rwt6z                                   0/1     Init:0/4   0          117s

Nov 02 '22 17:11 Bec-k

here are error from systemd containerd logs: https://gist.github.com/denissabramovs/a77e97972b5aa01c86955d812d3e8188

Nov 02 '22 17:11 Bec-k

Here is updated, latest one: https://gist.github.com/denissabramovs/2272051bb2f684f623cd15273ea6dd25

Nov 02 '22 17:11 Bec-k

at least, now containerd is not constantly restarting, it is already up for 9 minutes:

● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-11-02 18:57:44 EET; 9min ago

Nov 02 '22 17:11 Bec-k

All 3 systemd services are up and running on GPU node:

(base) beck@beck:/$ sudo systemctl status --no-pager kubelet containerd docker | grep active
     Active: active (running) since Wed 2022-11-02 18:57:49 EET; 14min ago
     Active: active (running) since Wed 2022-11-02 18:57:44 EET; 14min ago
     Active: active (running) since Wed 2022-11-02 19:09:12 EET; 3min 11s ago

Nov 02 '22 17:11 Bec-k

Sorry, missed your message. Here it is:

(base) beck@beck:/$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Nov 02 '22 17:11 Bec-k

(base) beck@beck:/$ containerd --version
containerd containerd.io 1.6.9 1c90a442489720eec95342e1789ee8a5e1b9536f

Nov 02 '22 17:11 Bec-k

@denissabramovs this is a wild guess: are you using containerd 1.6.9? I believe we had problems with this version and the operator. We downgraded to containerd 1.6.8 and things started working again.

Nov 02 '22 17:11 wjentner

revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=1.6.8:

nov 02 19:20:49 beck containerd[202761]: time="2022-11-02T19:20:49.723337417+02:00" level=info msg="starting containerd" revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=1.6.8
...
...
...
nov 02 19:22:34 beck containerd[202761]: time="2022-11-02T19:22:34.246953180+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-lbnzw,Uid:fd4f1d3f-29d2-4d11-a724-96f4ed107cd5,Namespace:gpu-operator,Attempt:0,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"

Killed/re-scheduled all pods in gpu-operator namespace after downgrading containerd.

Nov 02 '22 17:11 Bec-k

Oh wow! @wjentner you actually were right, i have re-enabled above mentioned toolkit and after downgrade, it finished without problems and all pods are up and running now!

Nov 02 '22 17:11 Bec-k

Good that i have captured both logs @shivamerla , adding those below.

These logs are from failing toolkit: https://gist.github.com/denissabramovs/0c3ad150ea2b9450a91b430a91704d08

These from successful toolkit: https://gist.github.com/denissabramovs/343c8fb0169866133fa1cc35b9d5365c

Hope this helps to find the problem and resolve it. It seems that they are different after all.

Nov 02 '22 17:11 Bec-k

Thanks @denissabramovs will check these out and try to repro with 1.6.9 containerd version.

Nov 02 '22 17:11 shivamerla

If you won't be able to reproduce, please ping me and i'll try to reproduce it locally again. Then we could catch that issue and possibly make some patch together. In any case, thank you guys.

Nov 02 '22 17:11 Bec-k

Issue diagnosed and workaround MR can be found here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/568

Nov 08 '22 14:11 klueska

@klueska thanks! When will this be released? I assume it has been also tested with contained 1.6.10 which has been released recently?

Dec 03 '22 00:12 wjentner

Hi @denissabramovs @wjentner. We just released v22.9.1. This includes the workaround mentioned above for resolving the containerd issues. Please give it a try and let us know if there are any issues.

Dec 14 '22 02:12 cdesiniotis

Thanks @cdesiniotis, I can confirm that it works with containerd 1.6.12 as well. Edit: 1.6.14 is also working.

Dec 16 '22 17:12 wjentner

Hi @cdesiniotis @klueska

it seems i have exactly the same issue with

OS: CentOS 7.9.2009 Kernel: 3.10.0-1160.76.1.el7.x86_64 Containerd: 1.6.9 & 1.6.14 (tested both) Gpu Operator: v22.9.1

my nvidia-driver-daemonset is looping module build seems OK, i see them appear in lsmod but after few second they disappear and everything restart

it failed after

nvidia-driver-ctr Post-install sanity check passed.
nvidia-driver-ctr
nvidia-driver-ctr Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 525.60.13) is now complete.
nvidia-driver-ctr
nvidia-driver-ctr Parsing kernel module parameters...
nvidia-driver-ctr Loading ipmi and i2c_core kernel modules...
nvidia-driver-ctr Loading NVIDIA driver kernel modules...
nvidia-driver-ctr + modprobe nvidia
nvidia-driver-ctr + modprobe nvidia-uvm
nvidia-driver-ctr + modprobe nvidia-modeset
nvidia-driver-ctr + set +o xtrace -o nounset
nvidia-driver-ctr Starting NVIDIA persistence daemon...
nvidia-driver-ctr ls: cannot access /proc/driver/nvidia-nvswitch/devices/*: No such file or directory
nvidia-driver-ctr Mounting NVIDIA driver rootfs...
nvidia-driver-ctr Done, now waiting for signal
nvidia-driver-ctr Caught signal
nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-ctr Unmounting NVIDIA driver rootfs...

if i downgrade containerd to 1.6.8 everything is fixed

Jan 09 '23 14:01 tuxtof

There is another issue with containerd: https://github.com/containerd/containerd/issues/7843

if containerd is restared (version 1.6.9 and above), most pods are restarted, so together with nvidia container toolkit pod they end in endless restarting loop as toolkit tries to restart containerd which restarts the toolkit and driver and everything loops again. There is a fix for containerd, but it may not land yet everywhere.

@tuxtof, I think you are hitting exactly this issue.

Jan 10 '23 01:01 xhejtman

thanks @xhejtman for linking the relevant issue.

Jan 10 '23 05:01 shivamerla

thanks @xhejtman

so what is the situation ? , GPU operator is no more working with containerd version 1.6.9 and above

Jan 10 '23 06:01 tuxtof