gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

latest gpu operator container toolkit daemonset behavior catastrophically breaks clusters running k0s

Open doctorpangloss opened this issue 2 months ago • 7 comments

Describe the bug https://github.com/k0sproject/k0s/issues/6547

The two step import you introduced, /etc/k0s/containerd.d/nvidia.toml -> /etc/containerd/conf.d/99-nvidia.toml breaks k0s clusters.

To Reproduce Use gpu-operator on a k0s cluster.

Expected behavior Don't be too exotic with how you put in these files.

Environment (please provide the following information):

  • GPU Operator Version: v25.10.0
  • OS: Ubuntu24.04
  • Kernel Version: 6.14
  • Container Runtime Version: containerd 1.7.22
  • Kubernetes Distro and Version: k0s

Information to attach (optional if deemed irrelevant)

(see referenced issue https://github.com/k0sproject/k0s/issues/6547)

doctorpangloss avatar Oct 26 '25 17:10 doctorpangloss

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: nvidia-gpu-operator
  namespace: gpu-operator
spec:
  ...
  values:
    ...
    toolkit:
      enabled: true
      env:
        - name: RUNTIME_CONFIG
          value: "/tmp/nvidia-operator-dummy-import.toml"
        - name: RUNTIME_DROP_IN_CONFIG
          value: "/etc/k0s/containerd.d/nvidia.toml"
        - name: CONTAINERD_SOCKET
          value: "/run/k0s/containerd.sock"
        - name: RUNTIME_EXECUTABLE_PATH
          value: "/var/lib/k0s/bin/containerd"
        - name: CONTAINERD_RUNTIME_CLASS
          value: "nvidia"
        - name: CONTAINERD_SET_AS_DEFAULT
          value: "false"

since your cluster will be dead, you will need to kubectl edit in the changes to the HelmRelease / redeploy.

then, clear the problematic file and restart k0s

ssh someusername@target-host -- 'sudo rm -f /etc/k0s/containerd.d/nvidia.toml && sudo k0s stop && sudo systemctl reset-failed k0sworker.service && sudo k0s start'

works around the issue for now

doctorpangloss avatar Oct 26 '25 17:10 doctorpangloss

sorry for this disruption @doctorpangloss

GPU operator 25.10.0 introduces a breaking change (which will be fully documented tomorrow). Helm command to install GPU operator for K0s is:

helm install gpu-operator nvidia/gpu-operator -n nvidia-gpu-operator --create-namespace --version=v25.10.0
--set toolkit.env[0].name=CONTAINERD_CONFIG
--set toolkit.env[0].value=/run/k0s/containerd-cri.toml
--set toolkit.env[1].name=RUNTIME_DROP_IN_CONFIG
--set toolkit.env[1].value=/etc/k0s/containerd.d/nvidia.toml
--set toolkit.env[2].name=CONTAINERD_SOCKET
--set toolkit.env[2].value=/run/k0s/containerd.sock
--set toolkit.env[3].name=CONTAINERD_RUNTIME_CLASS
--set toolkit.env[3].value=nvidia

Notice the additional parameter for RUNTIME_DROP_IN_CONFIG.

We are working with Mirantis team to reflect the change in their documentations too.

francisguillier avatar Oct 26 '25 19:10 francisguillier

Mirantis has updated their docs accordingly: https://catalog.k0rdent.io/latest/apps/nvidia/#install

francisguillier avatar Oct 27 '25 20:10 francisguillier

Should we expect a similar break down in k3s?

armaneshaghi avatar Oct 31 '25 11:10 armaneshaghi

Helm command to install GPU operator for K0s is:

helm install gpu-operator nvidia/gpu-operator -n nvidia-gpu-operator --create-namespace --version=v25.10.0 --set toolkit.env[0].name=CONTAINERD_CONFIG --set toolkit.env[0].value=/run/k0s/containerd-cri.toml --set toolkit.env[1].name=RUNTIME_DROP_IN_CONFIG --set toolkit.env[1].value=/etc/k0s/containerd.d/nvidia.toml

this isn't correct, you don't want nvidia container toolkit modifying the containerd config at all, only placing the dropin file. this will still be broken. use my workaround.

doctorpangloss avatar Nov 05 '25 00:11 doctorpangloss

Should we expect a similar break down in k3s?

check if it uses drop-in configuration and merges it, if so, then yes, otherwise, must test.

doctorpangloss avatar Nov 05 '25 00:11 doctorpangloss

Faced same catastrophic behavior one month ago. Trying to update in a k0s cluster from gpu-operator v24.9.2 with workaround stated here.

With @doctorpangloss parameters the default k0s runtime still gets wrecked and can run no more container With @francisguillier parameters the deployment of gpu-operator doesn't break anything, but CDI device injection isn't working for GPU via nvidia runtimeclass => Probably running into https://github.com/NVIDIA/gpu-operator/issues/1876 at that point

Archimonde666 avatar Nov 26 '25 14:11 Archimonde666