nvidia-container-toolkit-daemonset failing on GKE

Open jammy-d opened this issue 3 months ago • 1 comments

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug

nvidia-container-toolkit-daemonset continuously printing "failed to validate the driver, retrying after 5 seconds" and not becoming ready.
nvidia-dcgm-exporter, nvidia-device-plugin-daemonset, nvidia-operator-validator, gpu-feature-discovery all getting "no runtime for "nvidia" is configured"

To Reproduce When installing gpu-operator on GKE following the guide, with COS_CONTAINERD option

I'm running the equivalent to the steps:

helm install --wait --generate-name \
  -n gpu-operator \
  nvidia/gpu-operator \
  --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
  --set toolkit.installDir=/home/kubernetes/bin/nvidia \
  --set cdi.enabled=true \
  --set cdi.default=true \
  --set driver.enabled=false

There is one nvidia-l4 (g2 machine type) Node in the cluster

Expected behavior The gpu-operator should successfully install

Environment (please provide the following information):

GPU Operator Version: v25.3.1
OS: COS_CONTAINERD (latest on GKE)
Kernel Version:
Container Runtime Version:
Kubernetes Distro and Version: GKE

Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods -n gpu-operator

NAME                                                          READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-wqwlz                                   0/1     Init:0/1   0          9m48s
gpu-operator-857fc9cf65-s4ssn                                 1/1     Running    0          12m
gpu-operator-node-feature-discovery-gc-86f6495b55-vx7jk       1/1     Running    0          12m
gpu-operator-node-feature-discovery-master-694467d5db-24xsg   1/1     Running    0          58m
gpu-operator-node-feature-discovery-worker-9s9hf              1/1     Running    0          58m
gpu-operator-node-feature-discovery-worker-dsqf4              1/1     Running    0          58m
gpu-operator-node-feature-discovery-worker-g7khm              1/1     Running    0          9m59s
gpu-operator-node-feature-discovery-worker-hs2k2              1/1     Running    0          58m
nvidia-container-toolkit-daemonset-gzgns                      0/1     Init:0/1   0          9m48s
nvidia-dcgm-exporter-9mvgc                                    0/1     Init:0/1   0          9m48s
nvidia-device-plugin-daemonset-rqfjt                          0/1     Init:0/1   0          9m48s
nvidia-operator-validator-7rtxz                               0/1     Init:0/4   0          9m48s

[ ] kubernetes daemonset status: kubectl get ds -n gpu-operator

NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       58m
gpu-operator-node-feature-discovery-worker   4         4         4       4            4           <none>                                                                 58m
nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                           58m
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               58m
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                               58m
nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   58m
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 58m
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                          58m

[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

kubectl describe pod -n gpu-operator nvidia-container-toolkit-daemonset-gzgns
Name:                 nvidia-container-toolkit-daemonset-gzgns
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 gke-k8s-dev-np-devgpu-559f4a6b-njxh/10.0.10.8
Start Time:           Tue, 09 Sep 2025 11:46:10 +0900
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=799d8dbb6f
                      helm.sh/chart=gpu-operator-v25.3.1
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:                   10.100.3.5
IPs:
  IP:           10.100.3.5
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://a176cc18944b8f6d627e4c1ef368998718d3a70d5f8299d36caa573f90c8d040
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:0b6f1944b05254ce50a08d44ca0d23a40f254fb448255a9234c43dec44e6929c
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Running
      Started:      Tue, 09 Sep 2025 11:46:41 +0900
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4bnz9 (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      ROOT:                                                    /home/kubernetes/bin/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:         management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                                  void
      TOOLKIT_PID_FILE:                                        /run/nvidia/toolkit/toolkit.pid
      CDI_ENABLED:                                             true
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES:  nvidia.cdi.k8s.io/
      CRIO_CONFIG_MODE:                                        config
      NVIDIA_CONTAINER_RUNTIME_MODE:                           cdi
      RUNTIME:                                                 containerd
      CONTAINERD_RUNTIME_CLASS:                                nvidia
      RUNTIME_CONFIG:                                          /runtime/config-dir/config.toml
      CONTAINERD_CONFIG:                                       /runtime/config-dir/config.toml
      RUNTIME_SOCKET:                                          /runtime/sock-dir/containerd.sock
      CONTAINERD_SOCKET:                                       /runtime/sock-dir/containerd.sock
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /home/kubernetes/bin/nvidia from toolkit-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4bnz9 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /home/kubernetes/bin/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
  kube-api-access-4bnz9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  12m   default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-gzgns to gke-k8s-dev-np-devgpu-559f4a6b-njxh
  Normal  Pulling    12m   kubelet            Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1"
  Normal  Pulled     11m   kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1" in 18.211s (30.231s including waiting). Image size: 188844071 bytes.
  Normal  Created    11m   kubelet            Created container: driver-validation
  Normal  Started    11m   kubelet            Started container driver-validation

kubectl describe pod -n gpu-operator gpu-feature-discovery-wqwlz
Name:                 gpu-feature-discovery-wqwlz
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-gpu-feature-discovery
Node:                 gke-k8s-dev-np-devgpu-559f4a6b-njxh/10.0.10.8
Start Time:           Tue, 09 Sep 2025 11:46:09 +0900
Labels:               app=gpu-feature-discovery
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=nvidia-gpu
                      controller-revision-hash=65b59b8987
                      helm.sh/chart=gpu-operator-v25.3.1
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        DaemonSet/gpu-feature-discovery
Init Containers:
  toolkit-validation:
    Container ID:  
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qflk8 (ro)
Containers:
  gpu-feature-discovery:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.17.2
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      gpu-feature-discovery
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      GFD_SLEEP_INTERVAL:          60s
      GFD_FAIL_ON_INIT_ERROR:      true
      NAMESPACE:                   gpu-operator (v1:metadata.namespace)
      NODE_NAME:                    (v1:spec.nodeName)
      MIG_STRATEGY:                single
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /etc/kubernetes/node-feature-discovery/features.d from output-dir (rw)
      /sys from host-sys (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qflk8 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  output-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/node-feature-discovery/features.d
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  Directory
  kube-api-access-qflk8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.gpu-feature-discovery=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               12m                  default-scheduler  Successfully assigned gpu-operator/gpu-feature-discovery-wqwlz to gke-k8s-dev-np-devgpu-559f4a6b-njxh
  Warning  FailedCreatePodSandBox  12m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "b240c80a09c66485639216f16747add301ee0bdc86586360713c113d5fe1266f": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  11m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "f4e8123c725f79bc1ae88cec0be73549b2ab00ea5019f9cbfa240de990e127f4": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  11m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "5d2fc101c82bf418fed27c18ff3eed4da44bc6a0bb7a1b66e9fb2ddbac5d27e8": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  11m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "4276d02583b28af409923a1acd6f41f521c4c0730ca3f7cf14521fc658a9c7cd": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  11m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "96811a86c0ecce722d3f1af9ac34dba5b0463f63897ee9e224ae87c2e69bb41e": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  10m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "9ace2eb30abdd3a5c9495a07d2243dfd4eead358cdaf01c3523e960d24aea973": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  10m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7caa1b36eabe4689057b0e93e7f9b2c356ab69035334c921e7b76da6b8bf86af": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  10m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "75c01bee6582dc20b3d4e2b98be70201f0e99eb46318c848eab65d3581133d6e": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  10m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "d8c7ef3c5aa0ecd9b8244aadd5e612e1fb384484ea338eb238c4b92dbc7f9f53": no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  2m2s (x37 over 10m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "fa1e84d3cda485a1ec229f668fb0fbfaa7d61bb8f4ba619ee0002443de5ba11e": no runtime for "nvidia" is configured

[ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Sep 09 '25 03:09 jammy-d