k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

K3s in Docker (K3D) - `nvml error: insufficient permissions`

Open justinthelaw opened this issue 1 year ago • 0 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy
  • Kernel Version: Linux 6.8.0-76060800daily20240311-generic x86_64

  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):

See more...
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.25.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc io.containerd.runc.v2
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-76060800daily20240311-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 62.64GiB
 Name: law-laptop
 ID: 01ca24d8-09fa-4fe9-a828-f61b9d53ef7c
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
    1. Rancher, K3s v1.27.4-k3s1 with nvidia/cuda:12.2.0-base-ubuntu22.04 using K3d v5.6.0
    2. Rancher, K3s v1.28.8-k3s1 with nvidia/cuda:12.4.1-base-ubuntu22.04 using K3d v5.6.0
    3. Rancher, K3s v1.27.4-k3s1 with nvidia/cuda:12.4.1-base-ubuntu22.04 using K3d v5.6.0

2. Issue or feature description

I am currently having issues running the nvidia-device-plugin ever since an update to my NVIDIA drivers to 550.x. Usually, I am able to get pass-through access to my K3d containers through Docker without any issues using this K3s-Cuda support image. Recently, when using the new NVIDIA drivers, I am unable to access GPUs in Docker containers unless I force Docker's usage of CDI mode. Please see NVIDIA's documentation on enabling CDI for pass-through container access on Docker.

nvidia-smi works both within a Docker container and on my host system as root and as a non-root user. I am also able to run a gpu-support-test both in and out of my Docker containers.

The K3d extra args that I pass have varied as I turned on/off CDI mode. In CDI mode, I am unable to pass in --gpus all as CDI mode disables Docker's ability to recognize GPU devices. In CDI mode I do pass --env NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all@server:0. I also pass in the K3s-Cuda support image I previously mentioned, which also includes the nvidia-device-plugin manifest which follows the install documentation closely. I have also tried kubectl apply -f [...] after the K3d cluster spins up correctly.

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [X] The output of nvidia-smi -a on your host
See more...
Thu Apr 11 16:45:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P0             N/A /  115W |       8MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3003      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
  • [X] Your docker configuration file (e.g: /etc/docker/daemon.json)
See more...
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
  • [X] The k8s-device-plugin container logs
See more...
│ Events:                                                                                                                                                                                                         │
│   Type     Reason     Age                From               Message                                                                                                                                             │
│   ----     ------     ----               ----               -------                                                                                                                                             │
│   Normal   Scheduled  53s                default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-45djh to k3d-uds-server-0                                                          │
│   Normal   Pulling    51s                kubelet            Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.14.5"                                                                                            │
│   Normal   Pulled     33s                kubelet            Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.14.5" in 17.37s (17.37s including waiting)                                           │
│   Normal   Pulled     16s (x2 over 32s)  kubelet            Container image "nvcr.io/nvidia/k8s-device-plugin:v0.14.5" already present on machine                                                               │
│   Normal   Created    16s (x3 over 33s)  kubelet            Created container nvidia-device-plugin-ctr                                                                                                          │
│   Warning  Failed     16s (x3 over 33s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container proce │
│ ss: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'                                                                     │
│ nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown                                                                                                                       │
│   Warning  BackOff  4s (x4 over 31s)  kubelet  Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-45djh_kube-system(58c77b98-2569-4543-ab2c-daf47e113e1a)  

Additional information that might help better understand your environment and reproduce the bug:

  • [X] Docker version from docker version:
See more...
Client: Docker Engine - Community
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:17:48 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:17:48 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 nvidia:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • [X] Docker command, image and tag used: nvidia-device-plugin manifest, and also see Issue/Feature description for details
  • [X] Kernel version from uname -a: Linux 6.8.0-76060800daily20240311-generic x86_64
  • [X] Any relevant kernel output lines from dmesg: N/A, let me know what to grep for first.
  • [X] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*':
See more...
||/ Name                                  Version                                      Architecture Description
+++-=====================================-============================================-============-=========================================================
un  libgldispatch0-nvidia                 <none>                                       <none>       (no description available)
ii  libnvidia-cfg1-550:amd64              550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                    <none>                                       <none>       (no description available)
un  libnvidia-common                      <none>                                       <none>       (no description available)
ii  libnvidia-common-550                  550.54.14-1pop0~1709151545~22.04~c91e06a~dev all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                     <none>                                       <none>       (no description available)
rc  libnvidia-compute-525:amd64           545.29.06-1pop0~1701107297~22.04~7642405~dev amd64        Transitional package for libnvidia-compute-545
rc  libnvidia-compute-535-server:amd64    535.161.07-0ubuntu0.22.04.1                  amd64        NVIDIA libcompute package
rc  libnvidia-compute-545:amd64           550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        Transitional package for libnvidia-compute-550
ii  libnvidia-compute-550:amd64           550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA libcompute package
ii  libnvidia-container-tools             1.14.6-1                                     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.14.6-1                                     amd64        NVIDIA container runtime library
un  libnvidia-decode                      <none>                                       <none>       (no description available)
ii  libnvidia-decode-550:amd64            550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64          1:1.1.9-1.1                                  amd64        Wayland EGL External Platform library -- shared library
un  libnvidia-encode                      <none>                                       <none>       (no description available)
ii  libnvidia-encode-550:amd64            550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVENC Video Encoding runtime library
un  libnvidia-encode1                     <none>                                       <none>       (no description available)
un  libnvidia-extra                       <none>                                       <none>       (no description available)
ii  libnvidia-extra-550:amd64             550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                        <none>                                       <none>       (no description available)
ii  libnvidia-fbc1-550:amd64              550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                          <none>                                       <none>       (no description available)
un  libnvidia-gl-390                      <none>                                       <none>       (no description available)
un  libnvidia-gl-410                      <none>                                       <none>       (no description available)
ii  libnvidia-gl-550:amd64                550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-legacy-390xx-egl-wayland1   <none>                                       <none>       (no description available)
un  libnvidia-ml.so.1                     <none>                                       <none>       (no description available)
un  nvidia-384                            <none>                                       <none>       (no description available)
un  nvidia-390                            <none>                                       <none>       (no description available)
un  nvidia-common                         <none>                                       <none>       (no description available)
un  nvidia-compute-utils                  <none>                                       <none>       (no description available)
rc  nvidia-compute-utils-535-server       535.161.07-0ubuntu0.22.04.1                  amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-550              550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA compute utilities
un  nvidia-container-runtime              <none>                                       <none>       (no description available)
un  nvidia-container-runtime-hook         <none>                                       <none>       (no description available)
ii  nvidia-container-toolkit              1.14.6-1                                     amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.14.6-1                                     amd64        NVIDIA Container Toolkit Base
un  nvidia-cuda-dev                       <none>                                       <none>       (no description available)
un  nvidia-cuda-doc                       <none>                                       <none>       (no description available)
un  nvidia-cuda-gdb                       <none>                                       <none>       (no description available)
rc  nvidia-cuda-toolkit                   11.5.1-1ubuntu1                              amd64        NVIDIA CUDA development toolkit
un  nvidia-cuda-toolkit-doc               <none>                                       <none>       (no description available)
rc  nvidia-dkms-535-server                535.161.07-0ubuntu0.22.04.1                  amd64        NVIDIA DKMS package
ii  nvidia-dkms-550                       550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA DKMS package
rc  nvidia-dkms-550-open                  550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA DKMS package (open kernel module)
un  nvidia-dkms-kernel                    <none>                                       <none>       (no description available)
ii  nvidia-driver-550                     550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA driver metapackage
un  nvidia-driver-binary                  <none>                                       <none>       (no description available)
un  nvidia-egl-wayland-common             <none>                                       <none>       (no description available)
un  nvidia-firmware-535-server-535.161.07 <none>                                       <none>       (no description available)
ii  nvidia-firmware-550-550.54.14         550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        Firmware files used by the kernel module
un  nvidia-kernel-common                  <none>                                       <none>       (no description available)
rc  nvidia-kernel-common-525:amd64        545.29.06-1pop0~1701107297~22.04~7642405~dev amd64        Transitional package for nvidia-kernel-common-545
rc  nvidia-kernel-common-535-server       535.161.07-0ubuntu0.22.04.1                  amd64        Shared files used with the kernel module
rc  nvidia-kernel-common-545:amd64        550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        Transitional package for nvidia-kernel-common-550
ii  nvidia-kernel-common-550              550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        Shared files used with the kernel module
un  nvidia-kernel-source                  <none>                                       <none>       (no description available)
un  nvidia-kernel-source-535-server       <none>                                       <none>       (no description available)
ii  nvidia-kernel-source-550              550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA kernel source package
un  nvidia-kernel-source-550-open         <none>                                       <none>       (no description available)
un  nvidia-libopencl1-dev                 <none>                                       <none>       (no description available)
un  nvidia-opencl-dev                     <none>                                       <none>       (no description available)
un  nvidia-opencl-icd                     <none>                                       <none>       (no description available)
un  nvidia-persistenced                   <none>                                       <none>       (no description available)
un  nvidia-prime                          <none>                                       <none>       (no description available)
un  nvidia-profiler                       <none>                                       <none>       (no description available)
ii  nvidia-settings                       550.54.15-0ubuntu1                           amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary                <none>                                       <none>       (no description available)
un  nvidia-smi                            <none>                                       <none>       (no description available)
un  nvidia-utils                          <none>                                       <none>       (no description available)
ii  nvidia-utils-535:amd64                550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        Transitional package for nvidia-utils-550
ii  nvidia-utils-550                      550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA driver support binaries
un  nvidia-visual-profiler                <none>                                       <none>       (no description available)
ii  xserver-xorg-video-nvidia-550         550.54.14-1pop0~1709151545~22.04~c91e06a~dev amd64        NVIDIA binary Xorg driver
  • [X] NVIDIA container library version from nvidia-container-cli -V:
See more...
cli-version: 1.14.6
lib-version: 1.14.6
build date: 2024-02-27T20:51+00:00
build revision: d2eb0afe86f0b643e33624ee64f065dd60e952d4
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

justinthelaw avatar Apr 11 '24 21:04 justinthelaw