no runtime for "nvidia" is configured
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04
- Kernel Version: Kubernetes 1.24.14
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kops 1.24.1
- GPU Operator Version: 23.9.2
2. Issue or feature description
kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
Warning FailedCreatePodSandBox 12s (x10 over 2m11s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
It looks like the runtime isn't present as it's not found but it exists.
kubectl get runtimeclasses.node.k8s.io
NAME HANDLER AGE
nvidia nvidia 7d1h
kubectl describe runtimeclasses.node.k8s.io nvidia
Name: nvidia
Namespace:
Labels: app.kubernetes.io/component=gpu-operator
Annotations: <none>
API Version: node.k8s.io/v1
Handler: nvidia
Kind: RuntimeClass
Metadata:
Creation Timestamp: 2024-05-27T08:53:18Z
Owner References:
API Version: nvidia.com/v1
Block Owner Deletion: true
Controller: true
Kind: ClusterPolicy
Name: cluster-policy
UID: 2c237c3d-07eb-4856-8316-046489793e3d
Resource Version: 265073642
UID: 26fd5054-7344-4e6d-9029-a610ae0df560
Events: <none>
3. Steps to reproduce the issue
I installed the chart with helmfile
4. Information to attach (optional if deemed irrelevant)
kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-spbbk 0/1 Init:0/1 0 41s
gpu-operator-d97f85598-j7qt4 1/1 Running 0 7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8 1/1 Running 0 6d20h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj 1/1 Running 0 6d20h
gpu-operator-node-feature-discovery-worker-jfdsh 1/1 Running 0 85s
nvidia-container-toolkit-daemonset-vb6qn 0/1 Init:0/1 0 41s
nvidia-dcgm-exporter-9xmbm 0/1 Init:0/1 0 41s
nvidia-device-plugin-daemonset-w72xb 0/1 Init:0/1 0 41s
nvidia-driver-daemonset-v4n96 0/1 Running 0 73s
nvidia-operator-validator-vbq6v 0/1 Init:0/4 0 41s
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 7d
gpu-operator-node-feature-discovery-worker 1 1 1 1 1 instance-type=gpu 6d20h
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 7d
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 7d
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 7d
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 7d
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 7d
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 7d
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 7d
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
Warning FailedCreatePodSandBox 12s (x10 over 2m11s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
kubectl exec nvidia-driver-daemonset-v4n96 -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
Mon Jun 3 10:01:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
@yanis-incepto the nvidia-container-toolkit-daemonset-vb6qn is stuck in the init state and has not yet configured the nvidia runtime in containerd. Could you provide the logs for the containers in this daemonset?
nvidia-container-toolkit is finally running after some timebut still the error with the others (and it never gets away, i tried letting everyuthing during a few hours) :
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-t4bv8 0/1 Init:0/1 0 10m
gpu-operator-d97f85598-j7qt4 1/1 Running 0 7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8 1/1 Running 0 6d21h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj 1/1 Running 0 6d21h
gpu-operator-node-feature-discovery-worker-fcwh7 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-gn495 1/1 Running 0 10m
nvidia-dcgm-exporter-wnhss 0/1 Init:0/1 0 10m
nvidia-device-plugin-daemonset-dwwqr 0/1 Init:0/1 0 10m
nvidia-driver-daemonset-p47wp 1/1 Running 0 10m
nvidia-operator-validator-zk4mv 0/1 Init:0/4 0 10m
For his logs : it looks likle he's waiting for a signal :
kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-gn495
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
time="2024-06-03T10:46:29Z" level=info msg="Parsing arguments"
time="2024-06-03T10:46:29Z" level=info msg="Starting nvidia-toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Verifying Flags"
time="2024-06-03T10:46:29Z" level=info msg=Initializing
time="2024-06-03T10:46:29Z" level=info msg="Installing toolkit"
time="2024-06-03T10:46:29Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2024-06-03T10:46:29Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2024-06-03T10:46:29Z" level=info msg="Setting up runtime"
time="2024-06-03T10:46:29Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-06-03T10:46:29Z" level=info msg="Successfully parsed arguments"
time="2024-06-03T10:46:29Z" level=info msg="Starting 'setup' for containerd"
time="2024-06-03T10:46:29Z" level=info msg="Config file does not exist; using empty config"
time="2024-06-03T10:46:29Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-06-03T10:46:29Z" level=info msg="Sending SIGHUP signal to containerd"
time="2024-06-03T10:46:29Z" level=info msg="Successfully signaled containerd"
time="2024-06-03T10:46:29Z" level=info msg="Completed 'setup' for containerd"
time="2024-06-03T10:46:29Z" level=info msg="Waiting for signal"
Please restart the nvidia-operator-validator-zk4mv pod to start with. If this proceeds, then restart the other pods too.
just recreated the pod, still same issue
is there a compatibility table for gpu-operator ? Maybe latest version is not compatible with kubernetes 1.24.14 ?
I had this issue when my /etc/containerd/config.toml was incorrect (was missing runc from default). This is what it looks like now on each node:
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/bin/runc"
I had this issue when my
/etc/containerd/config.tomlwas incorrect (was missing runc from default). This is what it looks like now on each node:version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] BinaryName = "/usr/bin/runc"
Hello, Thanks for your help, but unfortunately, i just tried but it didn't work
@yanis-incepto can you share the contents of your /etc/containerd/config.toml file?
Hello
I have the exact same issue. My cluster in implemented base on k0s. The version of kubernetes cluster 1.32.4 and nvidia-smi output is.
kubectl exec -it nvidia-driver-daemonset-68thz -- nvidia-smi
Thu Jul 10 07:54:35 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | 0 |
| N/A 31C P0 42W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This issue has been open for a long time without recent updates, and the context may now be outdated. More details were requested in https://github.com/NVIDIA/gpu-operator/issues/730#issuecomment-2224096611 but there has been no update since then. Hence, closing this issue.
@behnm I would suggest following the latest procedure for installing GPU Operator 25.10.0 with k0s: https://catalog.k0rdent.io/latest/apps/nvidia/#install. If you are still experiencing problems, please file a new issue.