Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
1. Issue or feature description
I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu
I'm getting the same error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured .
2. Steps to reproduce the issue
I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu
Download k0s binary
curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s
chmod +x /tmp/k0s
Download k0sctl binary
curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl
chmod +x /usr/local/bin/k0sctl
Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster
k0sctl.yaml file
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
name: my-cluster
spec:
hosts:
- role: controller
localhost:
enabled: true
files:
- name: containerd-config
src: /tmp/containerd.toml
dstDir: /etc/k0s/
perm: "0755"
dirPerm: null
- role: worker
ssh:
address: 43.88.62.134
user: user
keyPath: .ssh/id_rsa
files:
- name: containerd-config
src: /tmp/containerd.toml
dstDir: /etc/k0s/
perm: "0755"
dirPerm: null
- role: worker
ssh:
address: 43.88.62.133
user: user
keyPath: .ssh/id_rsa
files:
- name: containerd-config
src: /tmp/containerd.toml
dstDir: /etc/k0s/
perm: "0755"
dirPerm: null
k0s:
version: 1.24.4+k0s.0
config:
spec:
network:
provider: calico
/tmp/k0s/containerd.toml file
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Then run the command: k0sctl apply --config /path/to/k0sctl.yaml
Deploy NVIDIA GPU Operator
values.yaml file
operator:
defaultRuntime: containerd
toolkit:
version: v1.10.0-ubuntu20.04
env:
- name: CONTAINERD_CONFIG
value: /etc/k0s/containerd.toml
- name: CONTAINERD_SOCKET
value: /run/k0s/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
driver:
manager:
image: k8s-driver-manager
repository: nvcr.io/nvidia/cloud-native
version: v0.4.0
imagePullPolicy: IfNotPresent
env:
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "true"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
repoConfig:
configMapName: repo-config
version: "495.29.05"
validator:
version: "v1.11.0"
Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
Now, add the NVIDIA Helm repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update
helm install --wait --generate-name \
nvidia/gpu-operator
helm upgrade --install --namespace=gpu-operator --create-namespace --wait --values=values.yaml gpu-operator nvidia/gpu-operator
1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?
- on both worker nodes the drivers/container-toolkit is pre-installed.
- on controller node its not installed because its non-GPU machine.
2. OS version Ubuntu 20.04.5 LTS
3. Status of all pods under gpu-operator namespace
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-jc4wt 0/1 Init:0/1 0 18h
gpu-feature-discovery-r27zv 0/1 Init:0/1 0 18h
gpu-operator-1673351272-node-feature-discovery-master-65d8hl88v 1/1 Running 0 18h
gpu-operator-1673351272-node-feature-discovery-worker-8j72k 1/1 Running 0 18h
gpu-operator-1673351272-node-feature-discovery-worker-wj5gd 1/1 Running 0 18h
gpu-operator-95b545d6f-r2cnp 1/1 Running 0 18h
nvidia-container-toolkit-daemonset-lg79g 1/1 Running 0 18h
nvidia-container-toolkit-daemonset-q26kq 1/1 Running 0 18h
nvidia-dcgm-exporter-2vpwj 0/1 Init:0/1 0 18h
nvidia-dcgm-exporter-gx6dv 0/1 Init:0/1 0 18h
nvidia-device-plugin-daemonset-tbbgb 0/1 Init:0/1 0 18h
nvidia-device-plugin-daemonset-z29kx 0/1 Init:0/1 0 18h
nvidia-operator-validator-79s4j 0/1 Init:0/4 0 18h
nvidia-operator-validator-thbq2 0/1 Init:0/4 0 18h
4. Logs from init-containers
from device-plugin
Error from server (BadRequest): container "toolkit-validation" in pod "nvidia-device-plugin-daemonset-tbbgb" is waiting to start: PodInitializing
from container-toolkit
time="2023-01-10T11:57:43Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:57:43Z" level=info msg="Config version: 2"
time="2023-01-10T11:57:43Z" level=info msg="Updating config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully updated config"
time="2023-01-10T11:57:43Z" level=info msg="Flushing config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:57:43Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:57:43Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:57:43Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:57:43Z" level=info msg="Waiting for signal"
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
time="2023-01-10T11:51:53Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:51:53Z" level=info msg="Config version: 2"
time="2023-01-10T11:51:53Z" level=info msg="Updating config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully updated config"
time="2023-01-10T11:51:53Z" level=info msg="Flushing config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:51:53Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:51:53Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:51:53Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:51:53Z" level=info msg="Waiting for signal"
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1601 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1601 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1736 G /usr/bin/gnome-shell 8MiB |
+-----------------------------------------------------------------------------+
Hi @captainsk7! Have you managed to solve this issue?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
k0s documentation has been updated to correctly configure nvidia-container-runtime. See: https://docs.k0sproject.io/stable/runtime/?h=gpu+operator#using-nvidia-container-runtime
Given that a lot has changed over time, I would encourage you to try latest version and see if that helps.
If you still have issues with the latest version of the NVIDIA GPU Operator, please feel free to open a new issue with updated details.