gpu-operator
gpu-operator copied to clipboard
nvidia-container-toolkit-daemonset failing on GKE
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
- nvidia-container-toolkit-daemonset continuously printing "failed to validate the driver, retrying after 5 seconds" and not becoming ready.
- nvidia-dcgm-exporter, nvidia-device-plugin-daemonset, nvidia-operator-validator, gpu-feature-discovery all getting "no runtime for "nvidia" is configured"
To Reproduce When installing gpu-operator on GKE following the guide, with COS_CONTAINERD option
I'm running the equivalent to the steps:
helm install --wait --generate-name \
-n gpu-operator \
nvidia/gpu-operator \
--set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
--set toolkit.installDir=/home/kubernetes/bin/nvidia \
--set cdi.enabled=true \
--set cdi.default=true \
--set driver.enabled=false
There is one nvidia-l4 (g2 machine type) Node in the cluster
Expected behavior The gpu-operator should successfully install
Environment (please provide the following information):
- GPU Operator Version: v25.3.1
- OS: COS_CONTAINERD (latest on GKE)
- Kernel Version:
- Container Runtime Version:
- Kubernetes Distro and Version: GKE
Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-wqwlz 0/1 Init:0/1 0 9m48s
gpu-operator-857fc9cf65-s4ssn 1/1 Running 0 12m
gpu-operator-node-feature-discovery-gc-86f6495b55-vx7jk 1/1 Running 0 12m
gpu-operator-node-feature-discovery-master-694467d5db-24xsg 1/1 Running 0 58m
gpu-operator-node-feature-discovery-worker-9s9hf 1/1 Running 0 58m
gpu-operator-node-feature-discovery-worker-dsqf4 1/1 Running 0 58m
gpu-operator-node-feature-discovery-worker-g7khm 1/1 Running 0 9m59s
gpu-operator-node-feature-discovery-worker-hs2k2 1/1 Running 0 58m
nvidia-container-toolkit-daemonset-gzgns 0/1 Init:0/1 0 9m48s
nvidia-dcgm-exporter-9mvgc 0/1 Init:0/1 0 9m48s
nvidia-device-plugin-daemonset-rqfjt 0/1 Init:0/1 0 9m48s
nvidia-operator-validator-7rtxz 0/1 Init:0/4 0 9m48s
- [ ] kubernetes daemonset status:
kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 58m
gpu-operator-node-feature-discovery-worker 4 4 4 4 4 <none> 58m
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 58m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 58m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 58m
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 58m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 58m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 58m
- [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod -n gpu-operator nvidia-container-toolkit-daemonset-gzgns
Name: nvidia-container-toolkit-daemonset-gzgns
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: gke-k8s-dev-np-devgpu-559f4a6b-njxh/10.0.10.8
Start Time: Tue, 09 Sep 2025 11:46:10 +0900
Labels: app=nvidia-container-toolkit-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=799d8dbb6f
helm.sh/chart=gpu-operator-v25.3.1
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.100.3.5
IPs:
IP: 10.100.3.5
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://a176cc18944b8f6d627e4c1ef368998718d3a70d5f8299d36caa573f90c8d040
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:0b6f1944b05254ce50a08d44ca0d23a40f254fb448255a9234c43dec44e6929c
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Running
Started: Tue, 09 Sep 2025 11:46:41 +0900
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4bnz9 (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID:
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
ROOT: /home/kubernetes/bin/nvidia
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
TOOLKIT_PID_FILE: /run/nvidia/toolkit/toolkit.pid
CDI_ENABLED: true
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES: nvidia.cdi.k8s.io/
CRIO_CONFIG_MODE: config
NVIDIA_CONTAINER_RUNTIME_MODE: cdi
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
RUNTIME_CONFIG: /runtime/config-dir/config.toml
CONTAINERD_CONFIG: /runtime/config-dir/config.toml
RUNTIME_SOCKET: /runtime/sock-dir/containerd.sock
CONTAINERD_SOCKET: /runtime/sock-dir/containerd.sock
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/driver-root from driver-install-dir (rw)
/home/kubernetes/bin/nvidia from toolkit-install-dir (rw)
/host from host-root (ro)
/run/nvidia/toolkit from toolkit-root (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4bnz9 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
toolkit-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/toolkit
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /home/kubernetes/bin/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
containerd-config:
Type: HostPath (bare host directory volume)
Path: /etc/containerd
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/containerd
HostPathType:
kube-api-access-4bnz9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-gzgns to gke-k8s-dev-np-devgpu-559f4a6b-njxh
Normal Pulling 12m kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1"
Normal Pulled 11m kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1" in 18.211s (30.231s including waiting). Image size: 188844071 bytes.
Normal Created 11m kubelet Created container: driver-validation
Normal Started 11m kubelet Started container driver-validation
kubectl describe pod -n gpu-operator gpu-feature-discovery-wqwlz
Name: gpu-feature-discovery-wqwlz
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-gpu-feature-discovery
Node: gke-k8s-dev-np-devgpu-559f4a6b-njxh/10.0.10.8
Start Time: Tue, 09 Sep 2025 11:46:09 +0900
Labels: app=gpu-feature-discovery
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=nvidia-gpu
controller-revision-hash=65b59b8987
helm.sh/chart=gpu-operator-v25.3.1
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/gpu-feature-discovery
Init Containers:
toolkit-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qflk8 (ro)
Containers:
gpu-feature-discovery:
Container ID:
Image: nvcr.io/nvidia/k8s-device-plugin:v0.17.2
Image ID:
Port: <none>
Host Port: <none>
Command:
gpu-feature-discovery
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
GFD_SLEEP_INTERVAL: 60s
GFD_FAIL_ON_INIT_ERROR: true
NAMESPACE: gpu-operator (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
MIG_STRATEGY: single
NVIDIA_MIG_MONITOR_DEVICES: all
Mounts:
/etc/kubernetes/node-feature-discovery/features.d from output-dir (rw)
/sys from host-sys (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qflk8 (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
output-dir:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/node-feature-discovery/features.d
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
kube-api-access-qflk8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.gpu-feature-discovery=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned gpu-operator/gpu-feature-discovery-wqwlz to gke-k8s-dev-np-devgpu-559f4a6b-njxh
Warning FailedCreatePodSandBox 12m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "b240c80a09c66485639216f16747add301ee0bdc86586360713c113d5fe1266f": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "f4e8123c725f79bc1ae88cec0be73549b2ab00ea5019f9cbfa240de990e127f4": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "5d2fc101c82bf418fed27c18ff3eed4da44bc6a0bb7a1b66e9fb2ddbac5d27e8": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "4276d02583b28af409923a1acd6f41f521c4c0730ca3f7cf14521fc658a9c7cd": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "96811a86c0ecce722d3f1af9ac34dba5b0463f63897ee9e224ae87c2e69bb41e": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "9ace2eb30abdd3a5c9495a07d2243dfd4eead358cdaf01c3523e960d24aea973": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7caa1b36eabe4689057b0e93e7f9b2c356ab69035334c921e7b76da6b8bf86af": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "75c01bee6582dc20b3d4e2b98be70201f0e99eb46318c848eab65d3581133d6e": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 10m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "d8c7ef3c5aa0ecd9b8244aadd5e612e1fb384484ea338eb238c4b92dbc7f9f53": no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 2m2s (x37 over 10m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "fa1e84d3cda485a1ec229f668fb0fbfaa7d61bb8f4ba619ee0002443de5ba11e": no runtime for "nvidia" is configured
- [ ] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [ ] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - [ ] containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]