no runtime for "nvidia" is configured
1. Issue or feature description
When following the quickstart I end up with this error in k describe po -n gpu-operator gpu-feature-discovery-6tk4h
Warning FailedCreatePodSandBox 0s (x5 over 49s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
2. Steps to reproduce the issue
#!/bin/bash
kind delete cluster --name bionic-gpt-cluster
kind create cluster --name bionic-gpt-cluster --config=kind-config.yaml
kind export kubeconfig --name bionic-gpt-cluster
# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
3. Information to attach (optional if deemed irrelevant)
with my kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
# If we don't do this, then we can't connect on linux
apiServerAddress: "0.0.0.0"
kubeadmConfigPatchesJSON6902:
- group: kubeadm.k8s.io
version: v1beta3
kind: ClusterConfiguration
patch: |
- op: add
path: /apiServer/certSANs/-
value: host.docker.internal
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
Common error checking:
- [ x ] The output of
nvidia-smi -aon your host
and docker run --rm nvidia/cuda:12.3.1-devel-centos7 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Sun Jan 21 20:24:49 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 41C P8 8W / 220W | 100MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json)
and /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
and /etc/containerd/config.toml
disabled_plugins = ["cri"]
version = 1
[plugins]
[plugins.cri]
[plugins.cri.containerd]
default_runtime_name = "nvidia"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Runtime = "/usr/bin/nvidia-container-runtime"
- [ ] The k8s-device-plugin container logs
I0121 20:28:50.870066 1 main.go:154] Starting FS watcher.
I0121 20:28:50.870195 1 main.go:161] Starting OS watcher.
I0121 20:28:50.870674 1 main.go:176] Starting Plugins.
I0121 20:28:50.870703 1 main.go:234] Loading configuration.
I0121 20:28:50.870918 1 main.go:242] Updating config with default resource matching patterns.
I0121 20:28:50.871290 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0121 20:28:50.871307 1 main.go:256] Retreiving plugins.
W0121 20:28:50.871782 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0121 20:28:50.871846 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0121 20:28:50.871896 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0121 20:28:50.871903 1 factory.go:115] Incompatible platform detected
E0121 20:28:50.871909 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0121 20:28:50.871914 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0121 20:28:50.871920 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0121 20:28:50.871925 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0121 20:28:50.871934 1 main.go:287] No devices found. Waiting indefinitely.
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet)
sudo journalctl -r -u kubelet
-- No entries --
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
docker version
Client: Docker Engine - Community
Version: 25.0.0
API version: 1.44
Go version: go1.21.6
Git commit: e758fe5
Built: Thu Jan 18 17:09:59 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 25.0.0
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: 615dfdf
Built: Thu Jan 18 17:09:59 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.27
GitCommit: a1496014c916f9e62104b33d1bb5bd03b0858e59
nvidia:
Version: 1.1.11
GitCommit: v1.1.11-0-g4bccb38
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [ ] Docker command, image and tag used
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
and the helm below fails as well:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
- [ ] Kernel version from
uname -a
uname -a Linux saruman 6.1.0-17-amd64 NVIDIA/k8s-device-plugin#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
- [ ] Any relevant kernel output lines from
dmesg
none that I see?
sudo dmesg |grep -i nvidia
[ 2.829492] nvidia: loading out-of-tree module taints kernel.
[ 2.829501] nvidia: module license 'NVIDIA' taints kernel.
[ 2.846803] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.961803] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 2.962598] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 3.011519] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.147.05 Wed Oct 25 20:27:35 UTC 2023
[ 3.017901] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input8
[ 3.139762] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.147.05 Wed Oct 25 20:21:31 UTC 2023
[ 3.246519] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 3.246521] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 3.288796] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input9
[ 3.288989] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input10
[ 3.328821] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[ 4.018783] audit: type=1400 audit(1705866938.070:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=774 comm="apparmor_parser"
[ 4.019493] audit: type=1400 audit(1705866938.070:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=774 comm="apparmor_parser"
[ 1754.666104] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1754.677753] nvidia-uvm: Loaded the UVM driver, major device number 237.
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
dpkg -l |grep -i nvidia
ii firmware-nvidia-gsp 525.147.05-4~deb12u1 amd64 NVIDIA GSP firmware
ii glx-alternative-nvidia 1.2.2 amd64 allows the selection of NVIDIA as GLX provider
ii libcuda1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA CUDA Driver Library
ii libegl-nvidia0:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary EGL library
ii libgl1-nvidia-glvnd-glx:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant)
ii libgles-nvidia1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL|ES 1.x library
ii libgles-nvidia2:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL|ES 2.x library
ii libglx-nvidia0:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary GLX library
ii libnvcuvid1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA CUDA Video Decoder runtime library
ii libnvidia-allocator1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA allocator runtime library
ii libnvidia-cfg1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-container-tools 1.14.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.14.3-1 amd64 NVIDIA container runtime library
ii libnvidia-egl-gbm1:amd64 1.1.0-2 amd64 GBM EGL external platform library for NVIDIA
ii libnvidia-egl-wayland1:amd64 1:1.1.10-1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-eglcore:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary EGL core libraries
ii libnvidia-encode1:amd64 525.147.05-4~deb12u1 amd64 NVENC Video Encoding runtime library
ii libnvidia-glcore:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL/GLX core libraries
ii libnvidia-glvkspirv:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary Vulkan Spir-V compiler library
ii libnvidia-ml1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA Management Library (NVML) runtime library
ii libnvidia-ptxjitcompiler1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA PTX JIT Compiler library
ii libnvidia-rtcore:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library
ii nvidia-alternative 525.147.05-4~deb12u1 amd64 allows the selection of NVIDIA as GLX provider
ii nvidia-container-toolkit 1.14.3-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.14.3-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-driver 525.147.05-4~deb12u1 amd64 NVIDIA metapackage
ii nvidia-driver-bin 525.147.05-4~deb12u1 amd64 NVIDIA driver support binaries
ii nvidia-driver-libs:amd64 525.147.05-4~deb12u1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii nvidia-egl-common 525.147.05-4~deb12u1 amd64 NVIDIA binary EGL driver - common files
ii nvidia-egl-icd:amd64 525.147.05-4~deb12u1 amd64 NVIDIA EGL installable client driver (ICD)
ii nvidia-installer-cleanup 20220217+3~deb12u1 amd64 cleanup after driver installation with the nvidia-installer
ii nvidia-kernel-common 20220217+3~deb12u1 amd64 NVIDIA binary kernel module support files
ii nvidia-kernel-dkms 525.147.05-4~deb12u1 amd64 NVIDIA binary kernel module DKMS source
ii nvidia-kernel-support 525.147.05-4~deb12u1 amd64 NVIDIA binary kernel module support files
ii nvidia-legacy-check 525.147.05-4~deb12u1 amd64 check for NVIDIA GPUs requiring a legacy driver
ii nvidia-modprobe 535.54.03-1~deb12u1 amd64 utility to load NVIDIA kernel modules and create device nodes
ii nvidia-persistenced 525.85.05-1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-settings 525.125.06-1~deb12u1 amd64 tool for configuring the NVIDIA graphics driver
ii nvidia-smi 525.147.05-4~deb12u1 amd64 NVIDIA System Management Interface
ii nvidia-support 20220217+3~deb12u1 amd64 NVIDIA binary graphics driver support files
ii nvidia-vdpau-driver:amd64 525.147.05-4~deb12u1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver
ii nvidia-vulkan-common 525.147.05-4~deb12u1 amd64 NVIDIA Vulkan driver - common files
ii nvidia-vulkan-icd:amd64 525.147.05-4~deb12u1 amd64 NVIDIA Vulkan installable client driver (ICD)
ii xserver-xorg-video-nvidia 525.147.05-4~deb12u1 amd64 NVIDIA binary Xorg driver
- [ ] NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V cli-version: 1.14.3 lib-version: 1.14.3 build date: 2023-10-19T11:32+00:00 build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [ ] NVIDIA container library logs (see troubleshooting)
the above page no longer exists.
sudo journalctl -u nvidia-container-toolkit -- No entries --
Of note, I have also tried without KinD and instead using k0s with the exact same result.
Could you confirm that you're able to run nvidia-smi in the Kind worker node?
I can confirm that it does not run inside kind:
on the bare metal:
nvidia-smi
Tue Jan 23 17:10:33 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 50C P8 12W / 220W | 260MiB / 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3883 G /usr/lib/xorg/Xorg 146MiB |
| 0 N/A N/A 4041 G /usr/bin/gnome-shell 67MiB |
| 0 N/A N/A 6091 G /usr/bin/nautilus 16MiB |
| 0 N/A N/A 78264 G ...b/firefox-esr/firefox-esr 10MiB |
| 0 N/A N/A 702357 G vlc 6MiB |
+-----------------------------------------------------------------------------+
from a container inside of k0s:
k logs nv-5dc699dbc6-xwhwt
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found
and from inside kind:
k logs nv-5df8456f86-9gkwf
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found
with this as my deployment:
cat nv-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
kompose.cmd: ./kompose convert -f docker-compose.yml
kompose.version: 1.22.0 (955b78124)
labels:
io.kompose.service: nv
name: nv
spec:
replicas: 1
selector:
matchLabels:
io.kompose.service: nv
template:
metadata:
annotations:
kompose.cmd: ./kompose convert -f docker-compose.yml
kompose.version: 1.22.0 (955b78124)
labels:
io.kompose.network/noworky-default: "true"
io.kompose.service: nv
spec:
containers:
- args:
- nvidia-smi
image: nvidia/cuda:12.3.1-devel-centos7
name: nv
restartPolicy: Always
What are you doing to inject GPU support into the docker container that kind starts to represent the k8s node?
Something like this is necessary: https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
Example: https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/clusters/kind/scripts/kind-cluster-config.yaml#L52
Using the example config you supplied I get the same results:
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found
I forgot to include that config file:
/etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
I even gave that create-cluster.sh script a try:
+++ local 'value=VERSION ?= v0.1.0'
+++ echo v0.1.0
++ DRIVER_IMAGE_VERSION=v0.1.0
++ : k8s-dra-driver
++ : ubuntu20.04
++ : v0.1.0
++ : nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0
++ : v1.27.1
++ : k8s-dra-driver-cluster
++ : /home/thoth/k8s-dra-driver/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : v20230515-01914134-containerd_v1.7.1
++ : gcr.io/k8s-staging-kind/base:v20230515-01914134-containerd_v1.7.1
++ : kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1
+ kind create cluster --retain --name k8s-dra-driver-cluster --image kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 --config /home/thoth/k8s-dra-driver/demo/clusters/kind/scripts/kind-cluster-config.yaml
Creating cluster "k8s-dra-driver-cluster" ...
✓ Ensuring node image (kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1) 🖼
✓ Preparing nodes 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-k8s-dra-driver-cluster"
You can now use your cluster with:
kubectl cluster-info --context kind-k8s-dra-driver-cluster
Thanks for using kind! 😊
+ docker exec -it k8s-dra-driver-cluster-worker umount -R /proc/driver/nvidia
++ docker images --filter reference=nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0 -q
+ EXISTING_IMAGE_ID=
+ '[' '' '!=' '' ']'
+ set +x
Cluster creation complete: k8s-dra-driver-cluster
Same results though.
appears to be the same issue here https://github.com/NVIDIA/k8s-device-plugin/issues/478
Backing up … what about running with GPUs under docker in general (I.e. without kind).
docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi
If things are not configured properly to have that work, then kind will not work either.
To be clear, that will work so long as accept-nvidia-visible-devices-as-volume-mounts = false
Once that is configured to true you would need to run:
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi
Both seem to work:
docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi 24-01-23 - 22:08:54
Wed Jan 24 04:09:07 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 12W / 220W | 156MiB / 8192MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base)
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi 24-01-23 - 22:09:08
Wed Jan 24 04:09:15 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 12W / 220W | 156MiB / 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base)
grep accept-nvidia-visible-devices-as /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
OK. That’s encouraging.
So you’re saying that even with that configured properly if you run the cluster-create.sh script from the k8s-dra-driver repo, docker exec into the worker node created by kind, and run nvidia-smi, it doesn’t work?
well at the moment ./create-cluster.sh ends with this error:
+ kind load docker-image --name k8s-dra-driver-cluster nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0
Image: "nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0" with ID "sha256:9c74ea73db6f97a5e7287e11888757504b1e5ecfde4d2e5aa8396a25749ae046" not yet present on node "k8s-dra-driver-cluster-control-plane", loading...
Image: "nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0" with ID "sha256:9c74ea73db6f97a5e7287e11888757504b1e5ecfde4d2e5aa8396a25749ae046" not yet present on node "k8s-dra-driver-cluster-worker", loading...
ERROR: failed to load image: command "docker exec --privileged -i k8s-dra-driver-cluster-control-plane ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: unpacking nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0 (sha256:e9df1b5622ca4f042dcff02f580a0a18ecad4b740fe639df2349c55067ef35b7)...time="2024-01-24T04:21:59Z" level=info msg="apply failure, attempting cleanup" error="wrong diff id calculated on extraction \"sha256:f344b08ff6c5121d786112e0f588c627da349e4289e409d1fde1b3ad8845fa66\"" key="extract-191866144-_8aF sha256:6c3e7df31590f02f10cb71fc4eb27653e9b428df2e6e5421a455b062bd2e39f9"
ctr: wrong diff id calculated on extraction "sha256:f344b08ff6c5121d786112e0f588c627da349e4289e409d1fde1b3ad8845fa66"
and ./install-dra-driver.sh now fails with:
+ kubectl label node k8s-dra-driver-cluster-control-plane --overwrite nvidia.com/dra.controller=true
node/k8s-dra-driver-cluster-control-plane labeled
+ helm upgrade -i --create-namespace --namespace nvidia-dra-driver nvidia /home/thoth/k8s-dra-driver/deployments/helm/k8s-dra-driver --wait
Release "nvidia" does not exist. Installing it now.
Error: client rate limiter Wait returned an error: context deadline exceeded
the build is successful from: ./build-dra-driver.sh
so I'm kind of confused at what is wrong.
I tried doing an equivalent ctr run with:
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi
but it is just hanging here with no output.
I figured out the equivalent ctr command ( I had nvidiacontainer missing above):
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidiacontainer nvidia-smi
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/bin/nvidia-smi": stat /usr/bin/nvidia-smi: no such file or directory: unknown
in comparison to the docker:
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi
Wed Jan 24 05:23:32 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 50C P8 12W / 220W | 117MiB / 8192MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
which I'm kind of uncertain why that file exists here, but not in the ctr form?
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 which nvidia-smi
/usr/bin/nvidia-smi
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi500 which nvidia-smi
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi501 ls /usr/bin/nvidia-smi
ls: cannot access '/usr/bin/nvidia-smi': No such file or directory
probably some magic I'm unaware of.
ctr does not use the nvidia-container-runtime even if you have configured the CRI plugin in the containerd config to use it. The ctr command does not use CRI so it would need to be configured elsewhere to use the nvidia runtime (but that wouldn’t help with your current problem anyway of trying to get k8s to work — which does communicate with containerd over CRI).
Since I don't have k0s experience, let's start out assuming that your goal is to install the GPU Operator in a Kind cluster with GPU support. This involves two stages:
- Starting a kind cluster with GPUs and the driver injected
- Installing the GPU Operator in this cluster.
I've tried to provide more details for each of the stages below. In order to get to the bottom of this issue we would need to identify which of these is not working as expected. Once we've run through the steps for kind it may be possible to map the steps to something like k0s.
Note that as prerequisites:
- the CUDA driver needs to be installed on the host. Since you're able to run
nvidia-smithere, that seems to already be the case. - The NVIDIA Container Toolkit needs to be installed on the host. The latest release (v1.14.4) is recommended.
Starting a kind cluster with GPUs and drivers injected.
This needs to be set up as described in https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
This means that we need to do the following:
- [ ] The
nvidiaruntime is configured as the default runtime in the docker daemon config. (Note that the Daemon needs to be restarted to apply this config).sudo nvidia-ctk runtime configure --runtime=docker --set-as-default sudo systemctl restart docker - [ ] The NVIDIA Container Runtime Hook is configured to allow device selection by mounts. (
accept-nvidia-visible-devices-as-volume-mounts = trueis set in/etc/nvidia-container-runtime/config.toml) - [ ] The applicable volume mount (
/dev/null:/var/run/nvidia-container-devices/all) is added to the Kind config for the nodes that require GPU access.
In order to verify that the nodes have the GPU devices and Driver installed correctly one can exec into the Kind worker node and run nvidia-smi:
docker exec -ti <node-cluster> nvidia-smi -L
This should give the same output as on the host. I noted in your example that you are starting a single node Kind cluster. This should not affect the behaviour, but is a difference between our cluster definitions and the ones that you use.
Installing the GPU Operator on the Kind cluster
At this point, the Kind cluster represents a k8s cluster with ony the GPU Driver installed. Even though the NVIDIA Container Toolkit is installed on the host, it has not been injected into the nodes.
This means that we should do one of the following:
- Explicitly install the NVIDIA Container Toolkit on each of the nodes
- Ensure that
--set toolkit.enabled=true(the default) is specified when installing the GPU Operator. (Note that your description mentions that--set toolkit.enabled=falsewas specified).
For the Kind demo included in this repo, we don't use the GPU operator and as such we install the container toolkit when creating the cluster: https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L38-L47
Note that since the Kind nodes themselves are effectively Debian nodes and are not officially supported. Most of this might be due to driver cotainer limitations and may not be applicable in this case, since we are dealing with a preinstalled driver.
on the host:
nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.14.4
commit: d167812ce3a55ec04ae2582eff1654ec812f42e1
cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
docker exec -it 3251f /bin/bash ✭
root@k8s-dra-driver-cluster-worker:/# nvidia-smi
Wed Jan 24 15:10:56 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 47C P8 12W / 220W | 169MiB / 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@k8s-dra-driver-cluster-worker:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-b83f1b66-74d7-a38e-932e-ef815cb45105)
However I seem to be stuck on the install inside the worker:
root@k8s-dra-driver-cluster-worker:/# apt-get install -y nvidia-container-toolkit
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
nvidia-container-toolkit is already the newest version (1.15.0~rc.1-1).
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
nvidia-container-toolkit : Depends: nvidia-container-toolkit-base (= 1.15.0~rc.1-1) but it is not going to be installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
root@k8s-dra-driver-cluster-worker:/# apt-get install -y nvidia-container-toolkit-base
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
nvidia-container-toolkit-base
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
3 not fully installed or removed.
Need to get 2361 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Get:1 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64 nvidia-container-toolkit-base 1.15.0~rc.1-1 [2361 kB]
Fetched 2361 kB in 0s (10.6 MB/s)
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 11315 files and directories currently installed.)
Preparing to unpack .../nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.15.0~rc.1-1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
root@k8s-dra-driver-cluster-worker:/# apt --fix-broken install
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Correcting dependencies... Done
The following additional packages will be installed:
nvidia-container-toolkit-base
The following NEW packages will be installed:
nvidia-container-toolkit-base
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
3 not fully installed or removed.
Need to get 2361 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64 nvidia-container-toolkit-base 1.15.0~rc.1-1 [2361 kB]
Fetched 2361 kB in 0s (11.9 MB/s)
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 11315 files and directories currently installed.)
Preparing to unpack .../nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.15.0~rc.1-1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
of note, I am using the kind cluster config from this repo:
https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/clusters/kind/scripts/kind-cluster-config.yaml#L52
so no longer single-node.
For "reasons" we were injecting the /usr/bin/nvidia-ctk binary from the host into the container for the k8s-dra-driver. This is what is causing:
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Remove the lines here in the kind cluster config. (Or unmount /usr/bin/nvidia-ctk before trying to install the toolkit).
I have an open action item to improve the installation of the toolkit in the DRA driver repo, but have not gotten around to it.
so unmounting /usr/bin/nvidia-ctk fixed the apt issues, and I can install nvidia-container-toolkit just fine, but that doesn't solve the problem, the nvidia-device-plugin-daemonset still seems unable to see the GPU
k logs -n kube-system nvidia-device-plugin-daemonset-d82pg
I0125 03:54:44.043725 1 main.go:154] Starting FS watcher.
I0125 03:54:44.043771 1 main.go:161] Starting OS watcher.
I0125 03:54:44.043840 1 main.go:176] Starting Plugins.
I0125 03:54:44.043849 1 main.go:234] Loading configuration.
I0125 03:54:44.043895 1 main.go:242] Updating config with default resource matching patterns.
I0125 03:54:44.043975 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0125 03:54:44.043979 1 main.go:256] Retreiving plugins.
W0125 03:54:44.044136 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0125 03:54:44.044156 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0125 03:54:44.044172 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0125 03:54:44.044174 1 factory.go:115] Incompatible platform detected
E0125 03:54:44.044176 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0125 03:54:44.044178 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0125 03:54:44.044179 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0125 03:54:44.044181 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0125 03:54:44.044185 1 main.go:287] No devices found. Waiting indefinitely.
@joshuacox is containerd in the Kind node configured to use the nvidia runtime. In addition, if you don't set it to be the default you will have to add a runtimeClass and specify this when installing the plugin.
See https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L50-L55 where we do this for the device plugin.
If you're installing the GPU Operator with --set toolkit.enabled=true this should be taken care of for you.
I am just fine with setting toolkit.enabled=true or any other flags, I just want it to work.
Seems to be getting closer, do I need to umount another symlink here?
k logs -ngpu-operator nvidia-operator-validator-j6hfp -c driver-validation
time="2024-01-25T10:46:28Z" level=info msg="version: 8072420d"
time="2024-01-25T10:46:28Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Thu Jan 25 10:46:28 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 11W / 220W | 152MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
time="2024-01-25T10:46:28Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-01-25T10:46:28Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.1.0-17-amd64\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
that was from a ./create-cluster.sh (in /k8s-dra-driver/demo/clusters/kind)
with this afterwards:
#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster
docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"
helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true
This issue is probably due to the symlink creation not working under kind. Please update the environement for the validator in the ClusterPolicy to disable the creation of symlinks as described in the error message.
See also https://github.com/NVIDIA/gpu-operator/issues/567
Environment for the validator in ClusterPolicy?
I have a tiny section of the daemonset that has a clusterpolicy
k get daemonset -n gpu-operator nvidia-operator-validator -o yaml|grep -C10 -i clusterpolicy
manager: kube-controller-manager
operation: Update
subresource: status
time: "2024-01-25T15:25:42Z"
name: nvidia-operator-validator
namespace: gpu-operator
ownerReferences:
- apiVersion: nvidia.com/v1
blockOwnerDeletion: true
controller: true
kind: ClusterPolicy
name: cluster-policy
uid: 1c2e2c3d-b21e-4767-8dd7-18c1535552de
resourceVersion: "23601"
uid: 30f847a6-654e-4136-b362-f912eb344d4c
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: nvidia-operator-validator
app.kubernetes.io/part-of: gpu-operator
Which all of this seems way beyond the documentation. @elezar is this because as you said "Debian nodes and are not officially supported". If so what nodes are supported? On this page:
https://nvidia.github.io/libnvidia-container/stable/deb/
it says:
ubuntu18.04, ubuntu20.04, ubuntu22.04, debian10, debian11
so is this all because my host OS is debian 12?
It just means when you start the operator, additionally pass:
--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION"
--set validator.driver.env[0].value="true"
Error: INSTALLATION FAILED: 1 error occurred:
* ClusterPolicy.nvidia.com "cluster-policy" is invalid: spec.validator.driver.env[0].value: Invalid value: "boolean": spec.validator.driver.env[0].value in body must be of type string: "boolean"
I also tried removing the quotes around true to match my other set lines, and got the exact same results.
#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster
docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"
helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION" \
--set validator.driver.env[0].value=true
I am also not seeing a validator section in the values.yaml:
https://github.com/NVIDIA/k8s-device-plugin/blob/v0.14.3/deployments/helm/nvidia-device-plugin/values.yaml
Am I looking in the wrong place?
use --set-string
not all possible values are shown in the top-level values.yaml
omg @klueska that one works!
kgp -n gpu-operator ✭
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-jkfwl 1/1 Running 0 3m59s
gpu-operator-1706209589-node-feature-discovery-gc-7ccd95f7qcvpg 1/1 Running 0 4m12s
gpu-operator-1706209589-node-feature-discovery-master-7cdfmh5zt 1/1 Running 0 4m12s
gpu-operator-1706209589-node-feature-discovery-worker-wcwsp 1/1 Running 0 4m12s
gpu-operator-1706209589-node-feature-discovery-worker-xdcxd 1/1 Running 0 4m12s
gpu-operator-c4fd7b4b7-rv28r 1/1 Running 0 4m12s
nvidia-container-toolkit-daemonset-n994z 1/1 Running 0 3m59s
nvidia-cuda-validator-76zm5 0/1 Completed 0 3m42s
nvidia-dcgm-exporter-b6cs5 1/1 Running 0 3m59s
nvidia-device-plugin-daemonset-4mbb2 1/1 Running 0 3m59s
nvidia-operator-validator-z26kp 1/1 Running 0 3m59s
and to be clear, for any of you stumbling in from the internet here is my complete additional steps, beyond ./create-cluster.sh:
#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster
docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"
helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION" \
--set-string validator.driver.env[0].value="true"
Now then why did I have to do all this extra work over and above the documentation? Is it just because I'm on debian 12 (I started on Arch linux before opening this issue I decided debian might be more stable). If this is the expected behavior I'll gladly make a PR documenting all this, but somehow I feel this is not the case? I am installing jammy22.04 to a partition to test some more.
You're probably the first to run the operator under kind.
Hmmm, now I am going to have to give this another shot using another method, as I said I've tried k0s above and will give that a second try now that I have a working sanity check. I am familiar bootstraping a cluster using kubeadm and kubespray both, I even scripted it all out with another project kubash.
Are there any other setups that anyone has tried? What is 'supported'?
I've transferred this issue to the gpu-operator repo (since that's what the issue was really related to). I'll let the operator devs answer your last question.