nvidia container toolkit stay in waiting to start
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
On a Rancher 1.24 Cluster, deployed gpu-operator v23.3.2 with helm3,
2. Steps to reproduce the issue
chart_version=v23.3.2;
name_space="gpu-operator";
release_name="gpu-operator";
confpath="/values.yaml";
helm install -f ${confpath} --wait ${release_name} -n ${name_space} nvidia/gpu-operator --version ${chart_version}
My values.yaml
mig:
strategy: single
driver:
enabled: true
version: "535.54.03"
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
# true for all pod in this runtime, otherwise only pod with the "runtimeClassName: nvidia"
value: "false"
- name: CONTAINERD_RESTART_MODE
value: systemd
I haven't installed nvidia container toolkit manually, thought the gpu-operator shall be able to install all the driver and toolkit itself.
Seeing the same issue here.
Ubuntu 22.04 GPU Operator: v24.6.1 Platform: arm64 K8s: K3s
No extra values config, just installing using:
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true \
nvidia/gpu-operator
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This issue has been open for over 90 days without recent updates, and the context may now be outdated.
@yingding I don't see nvidia-driver-daemonset pods running or scheduled and they are required before any other pods can come up. Node Feature Discovery (NFD) pods detects and labels nodes with specific labels. Its possible either this cluster is missing GPU nodes and hence none of the nodes have GPU specific label applied. Hence, there is no nvidia-driver daemonset installed in the cluster as there is no node to work on.
@stmcginnis gpu-operator has eevolved a lot since then and if you are still having issues with latest gpu-operator and k3s, please open a separate issue so that someone from the team can look into it.
Since this issue is very old, we are going to close it. If you are hitting same issue with latest version of gpu-operator as well even after verifying that the nodes have GPUs on them, please feel free to re-open or create a new issue.