latest gpu operator container toolkit daemonset behavior catastrophically breaks clusters running k0s
Describe the bug https://github.com/k0sproject/k0s/issues/6547
The two step import you introduced, /etc/k0s/containerd.d/nvidia.toml -> /etc/containerd/conf.d/99-nvidia.toml breaks k0s clusters.
To Reproduce Use gpu-operator on a k0s cluster.
Expected behavior Don't be too exotic with how you put in these files.
Environment (please provide the following information):
- GPU Operator Version: v25.10.0
- OS: Ubuntu24.04
- Kernel Version: 6.14
- Container Runtime Version: containerd 1.7.22
- Kubernetes Distro and Version: k0s
Information to attach (optional if deemed irrelevant)
(see referenced issue https://github.com/k0sproject/k0s/issues/6547)
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: nvidia-gpu-operator
namespace: gpu-operator
spec:
...
values:
...
toolkit:
enabled: true
env:
- name: RUNTIME_CONFIG
value: "/tmp/nvidia-operator-dummy-import.toml"
- name: RUNTIME_DROP_IN_CONFIG
value: "/etc/k0s/containerd.d/nvidia.toml"
- name: CONTAINERD_SOCKET
value: "/run/k0s/containerd.sock"
- name: RUNTIME_EXECUTABLE_PATH
value: "/var/lib/k0s/bin/containerd"
- name: CONTAINERD_RUNTIME_CLASS
value: "nvidia"
- name: CONTAINERD_SET_AS_DEFAULT
value: "false"
since your cluster will be dead, you will need to kubectl edit in the changes to the HelmRelease / redeploy.
then, clear the problematic file and restart k0s
ssh someusername@target-host -- 'sudo rm -f /etc/k0s/containerd.d/nvidia.toml && sudo k0s stop && sudo systemctl reset-failed k0sworker.service && sudo k0s start'
works around the issue for now
sorry for this disruption @doctorpangloss
GPU operator 25.10.0 introduces a breaking change (which will be fully documented tomorrow). Helm command to install GPU operator for K0s is:
helm install gpu-operator nvidia/gpu-operator -n nvidia-gpu-operator --create-namespace --version=v25.10.0
--set toolkit.env[0].name=CONTAINERD_CONFIG
--set toolkit.env[0].value=/run/k0s/containerd-cri.toml
--set toolkit.env[1].name=RUNTIME_DROP_IN_CONFIG
--set toolkit.env[1].value=/etc/k0s/containerd.d/nvidia.toml
--set toolkit.env[2].name=CONTAINERD_SOCKET
--set toolkit.env[2].value=/run/k0s/containerd.sock
--set toolkit.env[3].name=CONTAINERD_RUNTIME_CLASS
--set toolkit.env[3].value=nvidia
Notice the additional parameter for RUNTIME_DROP_IN_CONFIG.
We are working with Mirantis team to reflect the change in their documentations too.
Mirantis has updated their docs accordingly: https://catalog.k0rdent.io/latest/apps/nvidia/#install
Should we expect a similar break down in k3s?
Helm command to install GPU operator for K0s is:
helm install gpu-operator nvidia/gpu-operator -n nvidia-gpu-operator --create-namespace --version=v25.10.0 --set toolkit.env[0].name=CONTAINERD_CONFIG --set toolkit.env[0].value=/run/k0s/containerd-cri.toml --set toolkit.env[1].name=RUNTIME_DROP_IN_CONFIG --set toolkit.env[1].value=/etc/k0s/containerd.d/nvidia.toml
this isn't correct, you don't want nvidia container toolkit modifying the containerd config at all, only placing the dropin file. this will still be broken. use my workaround.
Should we expect a similar break down in k3s?
check if it uses drop-in configuration and merges it, if so, then yes, otherwise, must test.
Faced same catastrophic behavior one month ago. Trying to update in a k0s cluster from gpu-operator v24.9.2 with workaround stated here.
With @doctorpangloss parameters the default k0s runtime still gets wrecked and can run no more container With @francisguillier parameters the deployment of gpu-operator doesn't break anything, but CDI device injection isn't working for GPU via nvidia runtimeclass => Probably running into https://github.com/NVIDIA/gpu-operator/issues/1876 at that point