Failed to get sandbox runtime: no runtime for nvidia is configured
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
nov 02 18:00:58 beck containerd[10237]: time="2022-11-02T18:00:58.738797825+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:gpu-feature-discovery-qfjgk,Uid:02c7d4ad-db02-4145-846b-616a94416008,Namespace:gpu-operator,Attempt:2,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
-
[ ] kubernetes pods status:
kubectl get pods --all-namespaces -
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces -
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME -
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME -
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo -
[ ] Docker configuration file:
cat /etc/docker/daemon.json -
[ ] Docker runtime configuration:
docker info | grep runtime -
[ ] NVIDIA shared directory:
ls -la /run/nvidia -
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit -
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver -
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
(base) beck@beck:/$ ls -la /run/nvidia/
total 4
drwxr-xr-x 4 root root 100 nov 2 18:48 .
drwxr-xr-x 39 root root 1140 nov 2 18:47 ..
drwxr-xr-x 2 root root 40 nov 2 17:59 driver
-rw-r--r-- 1 root root 7 nov 2 18:48 toolkit.pid
drwxr-xr-x 2 root root 80 nov 2 18:48 validations
Driver folder is empty:
(base) beck@beck:/$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 nov 2 17:59 .
drwxr-xr-x 4 root root 80 nov 2 18:48 ..
(base) beck@beck:/$ sudo ctr run --rm -t \
--runc-binary=/usr/bin/nvidia-container-runtime \
--env NVIDIA_VISIBLE_DEVICES=all \
docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
cuda-11.0.3-base-ubuntu20.04 nvidia-smi
Wed Nov 2 16:50:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 46W / N/A | 601MiB / 8192MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
When i launch nvidia/cuda image via containerd cli, it is correctly detects and outputs my Nvidia GeForce video card, but for some reason, it doesn't see inside pods when deployed via helm.
Can you run kubectl get pods -n gpu-operator to show which pods are running. If you deployed with driver enabled, it takes 3-5 minutes for the drivers to be installed and followed by nvidia runtime setup. If you have already installed them on the host, please specify --set driver.enabled=false --set toolkit.enabled=false.
I was checking /etc/containerd/config.toml , it is changing it contantly back and forth.
containerd is always gets restarted by itself, because it fails to cleanup sandboxes and dead shims.
(base) beck@beck:/$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-w7vk6 1/1 Running 0 6m23s
gpu-operator-59b9d49c6f-7282l 1/1 Running 0 6m41s
nvidia-container-toolkit-daemonset-9rvz8 1/1 Running 7 (67s ago) 5m56s
nvidia-cuda-validator-7mp9j 0/1 Init:0/1 0 4m4s
nvidia-dcgm-exporter-2ktzc 0/1 PodInitializing 0 6m24s
nvidia-device-plugin-daemonset-wvvh4 0/1 PodInitializing 0 5m57s
nvidia-gpu-operator-node-feature-discovery-master-68495df8t9vd7 1/1 Running 0 6m41s
nvidia-gpu-operator-node-feature-discovery-worker-8gc88 1/1 Running 0 6m40s
nvidia-gpu-operator-node-feature-discovery-worker-stwpp 1/1 Running 9 (26s ago) 5m58s
nvidia-operator-validator-ptdgd 0/1 Init:Error 0 5m55s
you can disable toolkit as well by editing kubectl edit clusterpolicy and setting toolkit.enabled=false. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?
Can you also paste logs of nvidia-container-toolkit-daemonset-9rvz8 pod, curious as to why it is restarting. Which containerd and OS version is this?
Nope, didn't help. I have updated it, pod was removed and still complaining about:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
I have removed all pods, to trigger everything from scratch.
(base) beck@beck:/$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-8b8ls 0/1 Init:0/1 0 115s
gpu-operator-59b9d49c6f-gkk4j 1/1 Running 0 2m20s
nvidia-dcgm-exporter-6bmlt 0/1 Init:0/1 0 115s
nvidia-device-plugin-daemonset-f7xgb 0/1 Init:0/1 0 117s
nvidia-gpu-operator-node-feature-discovery-master-68495df8kscw7 1/1 Running 0 2m20s
nvidia-gpu-operator-node-feature-discovery-worker-pcxwq 1/1 Running 0 2m20s
nvidia-gpu-operator-node-feature-discovery-worker-s2jjn 1/1 Running 0 2m20s
nvidia-operator-validator-rwt6z 0/1 Init:0/4 0 117s
here are error from systemd containerd logs: https://gist.github.com/denissabramovs/a77e97972b5aa01c86955d812d3e8188
Here is updated, latest one: https://gist.github.com/denissabramovs/2272051bb2f684f623cd15273ea6dd25
at least, now containerd is not constantly restarting, it is already up for 9 minutes:
● containerd.service - containerd container runtime
Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-11-02 18:57:44 EET; 9min ago
All 3 systemd services are up and running on GPU node:
(base) beck@beck:/$ sudo systemctl status --no-pager kubelet containerd docker | grep active
Active: active (running) since Wed 2022-11-02 18:57:49 EET; 14min ago
Active: active (running) since Wed 2022-11-02 18:57:44 EET; 14min ago
Active: active (running) since Wed 2022-11-02 19:09:12 EET; 3min 11s ago
Sorry, missed your message. Here it is:
(base) beck@beck:/$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
(base) beck@beck:/$ containerd --version
containerd containerd.io 1.6.9 1c90a442489720eec95342e1789ee8a5e1b9536f
@denissabramovs this is a wild guess: are you using containerd 1.6.9? I believe we had problems with this version and the operator. We downgraded to containerd 1.6.8 and things started working again.
revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=1.6.8:
nov 02 19:20:49 beck containerd[202761]: time="2022-11-02T19:20:49.723337417+02:00" level=info msg="starting containerd" revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=1.6.8
...
...
...
nov 02 19:22:34 beck containerd[202761]: time="2022-11-02T19:22:34.246953180+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-lbnzw,Uid:fd4f1d3f-29d2-4d11-a724-96f4ed107cd5,Namespace:gpu-operator,Attempt:0,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"
Killed/re-scheduled all pods in gpu-operator namespace after downgrading containerd.
Oh wow! @wjentner you actually were right, i have re-enabled above mentioned toolkit and after downgrade, it finished without problems and all pods are up and running now!
Good that i have captured both logs @shivamerla , adding those below.
These logs are from failing toolkit: https://gist.github.com/denissabramovs/0c3ad150ea2b9450a91b430a91704d08
These from successful toolkit: https://gist.github.com/denissabramovs/343c8fb0169866133fa1cc35b9d5365c
Hope this helps to find the problem and resolve it. It seems that they are different after all.
Thanks @denissabramovs will check these out and try to repro with 1.6.9 containerd version.
If you won't be able to reproduce, please ping me and i'll try to reproduce it locally again. Then we could catch that issue and possibly make some patch together. In any case, thank you guys.
Issue diagnosed and workaround MR can be found here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/568
@klueska thanks! When will this be released? I assume it has been also tested with contained 1.6.10 which has been released recently?
Hi @denissabramovs @wjentner. We just released v22.9.1. This includes the workaround mentioned above for resolving the containerd issues. Please give it a try and let us know if there are any issues.
Thanks @cdesiniotis, I can confirm that it works with containerd 1.6.12 as well. Edit: 1.6.14 is also working.
Hi @cdesiniotis @klueska
it seems i have exactly the same issue with
OS: CentOS 7.9.2009 Kernel: 3.10.0-1160.76.1.el7.x86_64 Containerd: 1.6.9 & 1.6.14 (tested both) Gpu Operator: v22.9.1
my nvidia-driver-daemonset is looping module build seems OK, i see them appear in lsmod but after few second they disappear and everything restart
it failed after
nvidia-driver-ctr Post-install sanity check passed.
nvidia-driver-ctr
nvidia-driver-ctr Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 525.60.13) is now complete.
nvidia-driver-ctr
nvidia-driver-ctr Parsing kernel module parameters...
nvidia-driver-ctr Loading ipmi and i2c_core kernel modules...
nvidia-driver-ctr Loading NVIDIA driver kernel modules...
nvidia-driver-ctr + modprobe nvidia
nvidia-driver-ctr + modprobe nvidia-uvm
nvidia-driver-ctr + modprobe nvidia-modeset
nvidia-driver-ctr + set +o xtrace -o nounset
nvidia-driver-ctr Starting NVIDIA persistence daemon...
nvidia-driver-ctr ls: cannot access /proc/driver/nvidia-nvswitch/devices/*: No such file or directory
nvidia-driver-ctr Mounting NVIDIA driver rootfs...
nvidia-driver-ctr Done, now waiting for signal
nvidia-driver-ctr Caught signal
nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
if i downgrade containerd to 1.6.8 everything is fixed
There is another issue with containerd: https://github.com/containerd/containerd/issues/7843
if containerd is restared (version 1.6.9 and above), most pods are restarted, so together with nvidia container toolkit pod they end in endless restarting loop as toolkit tries to restart containerd which restarts the toolkit and driver and everything loops again. There is a fix for containerd, but it may not land yet everywhere.
@tuxtof, I think you are hitting exactly this issue.
thanks @xhejtman for linking the relevant issue.
thanks @xhejtman
so what is the situation ? , GPU operator is no more working with containerd version 1.6.9 and above