gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

nvidia container toolkit stay in waiting to start

Open yingding opened this issue 2 years ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [x] Are you running Kubernetes v1.13+?
  • [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

On a Rancher 1.24 Cluster, deployed gpu-operator v23.3.2 with helm3, Screenshot 2023-07-17 at 13 34 55

2. Steps to reproduce the issue

chart_version=v23.3.2;
name_space="gpu-operator";
release_name="gpu-operator";
confpath="/values.yaml";
helm install -f ${confpath} --wait ${release_name} -n ${name_space} nvidia/gpu-operator --version ${chart_version}

My values.yaml

mig:
  strategy: single

driver:
  enabled: true
  version: "535.54.03"

toolkit:
  enabled: true
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      # true for all pod in this runtime, otherwise only pod with the "runtimeClassName: nvidia"
      value: "false"
    - name: CONTAINERD_RESTART_MODE
      value: systemd

I haven't installed nvidia container toolkit manually, thought the gpu-operator shall be able to install all the driver and toolkit itself.

yingding avatar Jul 17 '23 11:07 yingding

Seeing the same issue here.

Ubuntu 22.04 GPU Operator: v24.6.1 Platform: arm64 K8s: K3s

No extra values config, just installing using:

helm install --wait gpu-operator \
     -n gpu-operator --create-namespace \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
     nvidia/gpu-operator

stmcginnis avatar Aug 30 '24 05:08 stmcginnis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated.

@yingding I don't see nvidia-driver-daemonset pods running or scheduled and they are required before any other pods can come up. Node Feature Discovery (NFD) pods detects and labels nodes with specific labels. Its possible either this cluster is missing GPU nodes and hence none of the nodes have GPU specific label applied. Hence, there is no nvidia-driver daemonset installed in the cluster as there is no node to work on.

@stmcginnis gpu-operator has eevolved a lot since then and if you are still having issues with latest gpu-operator and k3s, please open a separate issue so that someone from the team can look into it.

Since this issue is very old, we are going to close it. If you are hitting same issue with latest version of gpu-operator as well even after verifying that the nodes have GPUs on them, please feel free to re-open or create a new issue.

rahulait avatar Nov 13 '25 21:11 rahulait