K3s in Docker (K3D) - `nvml error: insufficient permissions`
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
-
Kernel Version:
Linux 6.8.0-76060800daily20240311-generic x86_64 -
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
See more...
Client: Docker Engine - Community
Version: 26.0.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.13.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.25.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 26.0.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: nvidia runc io.containerd.runc.v2
Default Runtime: nvidia
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.8.0-76060800daily20240311-generic
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.64GiB
Name: law-laptop
ID: 01ca24d8-09fa-4fe9-a828-f61b9d53ef7c
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
- Rancher, K3s v1.27.4-k3s1 with nvidia/cuda:12.2.0-base-ubuntu22.04 using K3d v5.6.0
- Rancher, K3s v1.28.8-k3s1 with nvidia/cuda:12.4.1-base-ubuntu22.04 using K3d v5.6.0
- Rancher, K3s v1.27.4-k3s1 with nvidia/cuda:12.4.1-base-ubuntu22.04 using K3d v5.6.0
2. Issue or feature description
I am currently having issues running the nvidia-device-plugin ever since an update to my NVIDIA drivers to 550.x. Usually, I am able to get pass-through access to my K3d containers through Docker without any issues using this K3s-Cuda support image. Recently, when using the new NVIDIA drivers, I am unable to access GPUs in Docker containers unless I force Docker's usage of CDI mode. Please see NVIDIA's documentation on enabling CDI for pass-through container access on Docker.
nvidia-smi works both within a Docker container and on my host system as root and as a non-root user. I am also able to run a gpu-support-test both in and out of my Docker containers.
The K3d extra args that I pass have varied as I turned on/off CDI mode. In CDI mode, I am unable to pass in --gpus all as CDI mode disables Docker's ability to recognize GPU devices. In CDI mode I do pass --env NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all@server:0. I also pass in the K3s-Cuda support image I previously mentioned, which also includes the nvidia-device-plugin manifest which follows the install documentation closely. I have also tried kubectl apply -f [...] after the K3d cluster spins up correctly.
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [X] The output of
nvidia-smi -aon your host
See more...
Thu Apr 11 16:45:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 50C P0 N/A / 115W | 8MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3003 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
- [X] Your docker configuration file (e.g:
/etc/docker/daemon.json)
See more...
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
- [X] The k8s-device-plugin container logs
See more...
│ Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 53s default-scheduler Successfully assigned kube-system/nvidia-device-plugin-daemonset-45djh to k3d-uds-server-0 │
│ Normal Pulling 51s kubelet Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.14.5" │
│ Normal Pulled 33s kubelet Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.14.5" in 17.37s (17.37s including waiting) │
│ Normal Pulled 16s (x2 over 32s) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.14.5" already present on machine │
│ Normal Created 16s (x3 over 33s) kubelet Created container nvidia-device-plugin-ctr │
│ Warning Failed 16s (x3 over 33s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container proce │
│ ss: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' │
│ nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown │
│ Warning BackOff 4s (x4 over 31s) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-45djh_kube-system(58c77b98-2569-4543-ab2c-daf47e113e1a)
Additional information that might help better understand your environment and reproduce the bug:
- [X] Docker version from
docker version:
See more...
Client: Docker Engine - Community
Version: 26.0.0
API version: 1.45
Go version: go1.21.8
Git commit: 2ae903e
Built: Wed Mar 20 15:17:48 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 26.0.0
API version: 1.45 (minimum version 1.24)
Go version: go1.21.8
Git commit: 8b79278
Built: Wed Mar 20 15:17:48 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.28
GitCommit: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
nvidia:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [X] Docker command, image and tag used: nvidia-device-plugin manifest, and also see Issue/Feature description for details
- [X] Kernel version from
uname -a:Linux 6.8.0-76060800daily20240311-generic x86_64 - [X] Any relevant kernel output lines from
dmesg: N/A, let me know what togrepfor first. - [X] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*':
See more...
||/ Name Version Architecture Description
+++-=====================================-============================================-============-=========================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-cfg1-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
un libnvidia-common <none> <none> (no description available)
ii libnvidia-common-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev all Shared files used by the NVIDIA libraries
un libnvidia-compute <none> <none> (no description available)
rc libnvidia-compute-525:amd64 545.29.06-1pop0~1701107297~22.04~7642405~dev amd64 Transitional package for libnvidia-compute-545
rc libnvidia-compute-535-server:amd64 535.161.07-0ubuntu0.22.04.1 amd64 NVIDIA libcompute package
rc libnvidia-compute-545:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 Transitional package for libnvidia-compute-550
ii libnvidia-compute-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.14.6-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.14.6-1 amd64 NVIDIA container runtime library
un libnvidia-decode <none> <none> (no description available)
ii libnvidia-decode-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library
un libnvidia-encode <none> <none> (no description available)
ii libnvidia-encode-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVENC Video Encoding runtime library
un libnvidia-encode1 <none> <none> (no description available)
un libnvidia-extra <none> <none> (no description available)
ii libnvidia-extra-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 Extra libraries for the NVIDIA driver
un libnvidia-fbc1 <none> <none> (no description available)
ii libnvidia-fbc1-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
un libnvidia-gl <none> <none> (no description available)
un libnvidia-gl-390 <none> <none> (no description available)
un libnvidia-gl-410 <none> <none> (no description available)
ii libnvidia-gl-550:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un libnvidia-legacy-390xx-egl-wayland1 <none> <none> (no description available)
un libnvidia-ml.so.1 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
un nvidia-compute-utils <none> <none> (no description available)
rc nvidia-compute-utils-535-server 535.161.07-0ubuntu0.22.04.1 amd64 NVIDIA compute utilities
ii nvidia-compute-utils-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA compute utilities
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.14.6-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.14.6-1 amd64 NVIDIA Container Toolkit Base
un nvidia-cuda-dev <none> <none> (no description available)
un nvidia-cuda-doc <none> <none> (no description available)
un nvidia-cuda-gdb <none> <none> (no description available)
rc nvidia-cuda-toolkit 11.5.1-1ubuntu1 amd64 NVIDIA CUDA development toolkit
un nvidia-cuda-toolkit-doc <none> <none> (no description available)
rc nvidia-dkms-535-server 535.161.07-0ubuntu0.22.04.1 amd64 NVIDIA DKMS package
ii nvidia-dkms-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA DKMS package
rc nvidia-dkms-550-open 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA DKMS package (open kernel module)
un nvidia-dkms-kernel <none> <none> (no description available)
ii nvidia-driver-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-egl-wayland-common <none> <none> (no description available)
un nvidia-firmware-535-server-535.161.07 <none> <none> (no description available)
ii nvidia-firmware-550-550.54.14 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 Firmware files used by the kernel module
un nvidia-kernel-common <none> <none> (no description available)
rc nvidia-kernel-common-525:amd64 545.29.06-1pop0~1701107297~22.04~7642405~dev amd64 Transitional package for nvidia-kernel-common-545
rc nvidia-kernel-common-535-server 535.161.07-0ubuntu0.22.04.1 amd64 Shared files used with the kernel module
rc nvidia-kernel-common-545:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 Transitional package for nvidia-kernel-common-550
ii nvidia-kernel-common-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
un nvidia-kernel-source-535-server <none> <none> (no description available)
ii nvidia-kernel-source-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA kernel source package
un nvidia-kernel-source-550-open <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
un nvidia-opencl-dev <none> <none> (no description available)
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
un nvidia-prime <none> <none> (no description available)
un nvidia-profiler <none> <none> (no description available)
ii nvidia-settings 550.54.15-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-535:amd64 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 Transitional package for nvidia-utils-550
ii nvidia-utils-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA driver support binaries
un nvidia-visual-profiler <none> <none> (no description available)
ii xserver-xorg-video-nvidia-550 550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64 NVIDIA binary Xorg driver
- [X] NVIDIA container library version from
nvidia-container-cli -V:
See more...
cli-version: 1.14.6
lib-version: 1.14.6
build date: 2024-02-27T20:51+00:00
build revision: d2eb0afe86f0b643e33624ee64f065dd60e952d4
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections