gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Open captainsk7 opened this issue 2 years ago • 1 comments

1. Issue or feature description

I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu I'm getting the same error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured .

2. Steps to reproduce the issue

I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu

Download k0s binary

curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s
chmod +x /tmp/k0s

Download k0sctl binary

curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl
chmod +x /usr/local/bin/k0sctl

Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster

k0sctl.yaml file

apiVersion: k0sctl.k0sproject.io/v1beta1
kind:  Cluster
metadata:
  name: my-cluster
spec:
  hosts:
    - role: controller
      localhost:
        enabled: true
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
    - role: worker
      ssh:
        address: 43.88.62.134
        user: user
        keyPath: .ssh/id_rsa
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
    - role: worker
      ssh:
        address: 43.88.62.133
        user: user
        keyPath: .ssh/id_rsa
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
  k0s:
    version: 1.24.4+k0s.0
    config:
      spec:
        network:
          provider: calico

/tmp/k0s/containerd.toml file

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

Then run the command: k0sctl apply --config /path/to/k0sctl.yaml

Deploy NVIDIA GPU Operator

values.yaml file

operator:
  defaultRuntime: containerd

toolkit:
  version: v1.10.0-ubuntu20.04
  env:
    - name: CONTAINERD_CONFIG
      value: /etc/k0s/containerd.toml
    - name: CONTAINERD_SOCKET
      value: /run/k0s/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

driver:
  manager:
    image: k8s-driver-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.4.0
    imagePullPolicy: IfNotPresent
    env:
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
  repoConfig:
    configMapName: repo-config
  version: "495.29.05"


validator:
version: "v1.11.0"

Install Helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Now, add the NVIDIA Helm repository:

helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update
helm install --wait --generate-name \
     nvidia/gpu-operator
helm upgrade     --install     --namespace=gpu-operator     --create-namespace     --wait     --values=values.yaml    gpu-operator     nvidia/gpu-operator

1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?

  • on both worker nodes the drivers/container-toolkit is pre-installed.
  • on controller node its not installed because its non-GPU machine.

2. OS version Ubuntu 20.04.5 LTS

3. Status of all pods under gpu-operator namespace

NAME                                                              READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-jc4wt                                       0/1     Init:0/1   0          18h
gpu-feature-discovery-r27zv                                       0/1     Init:0/1   0          18h
gpu-operator-1673351272-node-feature-discovery-master-65d8hl88v   1/1     Running    0          18h
gpu-operator-1673351272-node-feature-discovery-worker-8j72k       1/1     Running    0          18h
gpu-operator-1673351272-node-feature-discovery-worker-wj5gd       1/1     Running    0          18h
gpu-operator-95b545d6f-r2cnp                                      1/1     Running    0          18h
nvidia-container-toolkit-daemonset-lg79g                          1/1     Running    0          18h
nvidia-container-toolkit-daemonset-q26kq                          1/1     Running    0          18h
nvidia-dcgm-exporter-2vpwj                                        0/1     Init:0/1   0          18h
nvidia-dcgm-exporter-gx6dv                                        0/1     Init:0/1   0          18h
nvidia-device-plugin-daemonset-tbbgb                              0/1     Init:0/1   0          18h
nvidia-device-plugin-daemonset-z29kx                              0/1     Init:0/1   0          18h
nvidia-operator-validator-79s4j                                   0/1     Init:0/4   0          18h
nvidia-operator-validator-thbq2                                   0/1     Init:0/4   0          18h

4. Logs from init-containers

from device-plugin

Error from server (BadRequest): container "toolkit-validation" in pod "nvidia-device-plugin-daemonset-tbbgb" is waiting to start: PodInitializing

from container-toolkit
time="2023-01-10T11:57:43Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:57:43Z" level=info msg="Config version: 2"
time="2023-01-10T11:57:43Z" level=info msg="Updating config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully updated config"
time="2023-01-10T11:57:43Z" level=info msg="Flushing config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:57:43Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:57:43Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:57:43Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:57:43Z" level=info msg="Waiting for signal"
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
time="2023-01-10T11:51:53Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:51:53Z" level=info msg="Config version: 2"
time="2023-01-10T11:51:53Z" level=info msg="Updating config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully updated config"
time="2023-01-10T11:51:53Z" level=info msg="Flushing config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:51:53Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:51:53Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:51:53Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:51:53Z" level=info msg="Waiting for signal"

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1601      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1601      G   /usr/lib/xorg/Xorg                  9MiB |
|    1   N/A  N/A      1736      G   /usr/bin/gnome-shell                8MiB |
+-----------------------------------------------------------------------------+

captainsk7 avatar Jan 03 '23 17:01 captainsk7

Hi @captainsk7! Have you managed to solve this issue?

nekwar avatar Jun 05 '24 14:06 nekwar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

k0s documentation has been updated to correctly configure nvidia-container-runtime. See: https://docs.k0sproject.io/stable/runtime/?h=gpu+operator#using-nvidia-container-runtime

Given that a lot has changed over time, I would encourage you to try latest version and see if that helps.

If you still have issues with the latest version of the NVIDIA GPU Operator, please feel free to open a new issue with updated details.

rahulait avatar Nov 13 '25 17:11 rahulait