Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
I'm deploying gpu-operator from Helm chart using ArgoCD in my Kubernetes cluster (1.23.17), which is built using kops on AWS infrastructure (not EKS).
Now I've been struggling with this for a while now, I've had used both docker and containerd in my Kubernetes cluster as a container runtime engine. I'm currently running containerd v1.6.21
After deploying the gpu-operator this is what is happening in the gpu-operator namespace:
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-jtgll 0/1 Init:0/1 0 11m
gpu-feature-discovery-m82hx 0/1 Init:0/1 0 11m
gpu-feature-discovery-rzkzj 0/1 Init:0/1 0 11m
gpu-operator-6489b6d9-d5smv 1/1 Running 0 11m
gpu-operator-node-feature-discovery-master-86dd7c646-6jvns 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-5r7g6 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-5v7bn 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-6lzkk 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-7z6zw 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-8t9hk 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-b7k2t 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-fz7f2 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-hdp28 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-j9f45 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-rqx4l 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-svk5h 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-v6rx9 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-wd7h7 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-wqsp5 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-xf7m6 1/1 Running 0 11m
nvidia-container-toolkit-daemonset-26djz 1/1 Running 0 11m
nvidia-container-toolkit-daemonset-72mvg 1/1 Running 0 11m
nvidia-container-toolkit-daemonset-trk6f 1/1 Running 0 11m
nvidia-dcgm-exporter-bpvks 0/1 Init:0/1 0 11m
nvidia-dcgm-exporter-cchvm 0/1 Init:0/1 0 11m
nvidia-dcgm-exporter-fd98x 0/1 Init:0/1 0 11m
nvidia-device-plugin-daemonset-fwwgr 0/1 Init:0/1 0 11m
nvidia-device-plugin-daemonset-kblb6 0/1 Init:0/1 0 11m
nvidia-device-plugin-daemonset-zlgdm 0/1 Init:0/1 0 11m
nvidia-driver-daemonset-mg5g8 1/1 Running 0 11m
nvidia-driver-daemonset-tschz 1/1 Running 0 11m
nvidia-driver-daemonset-x285r 1/1 Running 0 11m
nvidia-operator-validator-qjgsb 0/1 Init:0/4 0 11m
nvidia-operator-validator-trlfn 0/1 Init:0/4 0 11m
nvidia-operator-validator-vtkdz 0/1 Init:0/4 0 11m
Getting into more details on the pods that are stuck in the init state:
kubectl -n gpu-operator describe po gpu-feature-discovery-jtgll
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 23m default-scheduler Successfully assigned gpu-operator/gpu-feature-discovery-jtgll to ip-172-20-99-192.eu-west-1.compute.internal
Warning FailedCreatePodSandBox 3m46s (x93 over 23m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
kubectl -n gpu-operator describe po nvidia-dcgm-exporter-bpvks
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned gpu-operator/nvidia-dcgm-exporter-bpvks to ip-172-20-45-35.eu-west-1.compute.internal
Warning FailedCreatePodSandBox 22m kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Warning FailedCreatePodSandBox 4m43s (x93 over 24m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
kubectl -n gpu-operator describe po nvidia-device-plugin-daemonset-fwwgr
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-fwwgr to ip-172-20-99-192.eu-west-1.compute.internal
Warning FailedCreatePodSandBox 23m kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Warning FailedCreatePodSandBox 31s (x117 over 25m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
kubectl -n gpu-operator describe po nvidia-operator-validator-qjgsb
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 26m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-qjgsb to ip-172-20-99-192.eu-west-1.compute.internal
Warning FailedCreatePodSandBox 80s (x117 over 26m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
And finally my ClusterPolicy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
annotations:
helm.sh/resource-policy: keep
creationTimestamp: "2023-07-12T14:42:17Z"
generation: 1
labels:
app.kubernetes.io/component: gpu-operator
app.kubernetes.io/instance: gpu-operator
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: gpu-operator
app.kubernetes.io/version: v23.3.2
argocd.argoproj.io/instance: gpu-operator
helm.sh/chart: gpu-operator-v23.3.2
name: cluster-policy
resourceVersion: "223035606"
uid: 961e3b87-a5ff-47d9-944d-f9cca9e72fa9
spec:
cdi:
default: false
enabled: false
daemonsets:
labels:
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.3.2
priorityClassName: system-node-critical
rollingUpdate:
maxUnavailable: "1"
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
updateStrategy: RollingUpdate
dcgm:
enabled: false
hostPort: 5555
image: dcgm
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: 3.1.7-1-ubuntu20.04
dcgmExporter:
enabled: true
env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
image: dcgm-exporter
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/k8s
serviceMonitor:
additionalLabels: {}
enabled: false
honorLabels: false
interval: 15s
version: 3.1.7-3.1.4-ubuntu20.04
devicePlugin:
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: k8s-device-plugin
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: v0.14.0-ubi8
driver:
certConfig:
name: ""
enabled: true
image: driver
imagePullPolicy: IfNotPresent
kernelModuleConfig:
name: ""
licensingConfig:
configMapName: ""
nlsEnabled: false
manager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "false"
- name: DRAIN_USE_FORCE
value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: 0s
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "false"
image: k8s-driver-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.1
rdma:
enabled: false
useHostMofed: false
repoConfig:
configMapName: ""
repository: nvcr.io/nvidia
startupProbe:
failureThreshold: 120
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 60
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
usePrecompiled: false
version: 525.105.17
virtualTopology:
config: ""
gfd:
enabled: true
env:
- name: GFD_SLEEP_INTERVAL
value: 60s
- name: GFD_FAIL_ON_INIT_ERROR
value: "true"
image: gpu-feature-discovery
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: v0.8.0-ubi8
mig:
strategy: single
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
env:
- name: WITH_REBOOT
value: "false"
gpuClientsConfig:
name: ""
image: k8s-mig-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.5.2-ubuntu20.04
nodeStatusExporter:
enabled: false
image: gpu-operator-validator
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v23.3.2
operator:
defaultRuntime: containerd
initContainer:
image: cuda
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: 12.1.1-base-ubi8
runtimeClass: nvidia
psp:
enabled: false
sandboxDevicePlugin:
enabled: true
image: kubevirt-gpu-device-plugin
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: v1.2.1
sandboxWorkloads:
defaultWorkload: container
enabled: false
toolkit:
enabled: true
image: container-toolkit
imagePullPolicy: IfNotPresent
installDir: /usr/local/nvidia
repository: nvcr.io/nvidia/k8s
version: v1.13.0-ubuntu20.04
validator:
image: gpu-operator-validator
imagePullPolicy: IfNotPresent
plugin:
env:
- name: WITH_WORKLOAD
value: "true"
repository: nvcr.io/nvidia/cloud-native
version: v23.3.2
vfioManager:
driverManager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "false"
- name: ENABLE_AUTO_DRAIN
value: "false"
image: k8s-driver-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.1
enabled: true
image: cuda
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: 12.1.1-base-ubi8
vgpuDeviceManager:
config:
default: default
name: ""
enabled: true
image: vgpu-device-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.2.1
vgpuManager:
driverManager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "false"
- name: ENABLE_AUTO_DRAIN
value: "false"
image: k8s-driver-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.1
enabled: false
image: vgpu-manager
imagePullPolicy: IfNotPresent
status:
namespace: gpu-operator
state: notReady
2. Steps to reproduce the issue
Deploy gpu-operator using Helm chart (23.3.2)
3. Information to attach (optional if deemed irrelevant)
-
[ ] kubernetes pods status:
kubectl get pods --all-namespaces -
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces -
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME -
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME -
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo -
[ ] Docker configuration file:
cat /etc/docker/daemon.json -
[ ] Docker runtime configuration:
docker info | grep runtime -
[ ] NVIDIA shared directory:
ls -la /run/nvidia -
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit -
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver -
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
Additional info: apart from changing container runtime interface from docker to containerd, I have also tried different gpu-operator settings (values), with both CDI enabled/disabled, RDMA enabled/disabled and other - to no avail.
did you ever figure this out @BartoszZawadzki? Dealing with same issue on EKS and ubuntu
No, but since I'm using kops I've tried using this - https://kops.sigs.k8s.io/gpu/ and it worked out-of-the-box
also meet this problem. How to solve it?
failed to get sandbox runtime: no runtime for "nvidia" this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs of nvidia-driver-daemonset and nvidia-container-toolkit pods to figure out the actual error.
@shivamerla
- gpu-operator support Rocky Linux 9.1 (Blue Onyx)?
No, Rocky Linux is not supported currently.
failed to get sandbox runtime: no runtime for "nvidia"this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs ofnvidia-driver-daemonsetandnvidia-container-toolkitpods to figure out the actual error.
I have attached logs from all containers deployed via gpu-operator helm chart in the inital issue.
We're running into the same problem, the pods gpu-feature-discovery, nvidia-operator-validator, nvidia-dcgm-exporter and nvidia-device-plugin-daemonset are all not starting because of Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
nivia-gpu-operator-node-feature-discovery-worker log
nvidia-container-toolkit-daemonset log
EDIT: Our problem is this issue in containerd which makes it impossible to additively use the imports to configure containerd plugins. In our case we're configuring registry mirrors which in turn completely overrides nvidias runtime configuration. We're probably going to have to go the same route as nvidia, meaning we'd have to somehow parse the config.toml, add our config and write it back.
Hi bro, I once encountered the same error. I'll give you my example for your reference. A week ago, I installed the nvidia driver, toolkits and device-plugin manually for test gpu running. I run containerd as runtime for kubelet, on ubuntu 22.04, then it works on cuda testing. A few days ago I tried gpu-operator installation, before that i uninstall nvidia driver, toolkits and device-plugin, and reverted the /etc/containerd/config.toml config. I got the same error as you.I had read many old issues about this err, then I found a committer of gpu-operator recommended lsmod | grep nvidia command, so I found some nvidia driver using by ubuntu kernel, meaned that uninstall imcompletely, so i reboot my host, and lsmod | grep nvidia command get nothing. Glad to say, everything is ok, all the nvidia pod become running. Hope useful to you !