k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

"nvidia-smi": executable file not found in $PATH: unknown

Open devriesewouter89 opened this issue 3 years ago • 5 comments

1. Issue or feature description

When booting a container on k8s (via k3s) I notice my container doesn't contain "nvidia-smi" in /usr/bin or elsewhere. When I launch the same image/container not via k8s I do get the command.

vex@vex-slave4:~$ kubectl exec -it gpu -- nvidia-smi
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "447b3dd0509b66403603e0c66fa7c524259d111afc3db4c41ce59498d58bb8c6": OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

2. Steps to reproduce the issue

my deployment yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: "nvidia/cuda:11.4.1-base-ubuntu20.04"
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      resources:
        limits:
          nvidia.com/gpu: 1

Using different base images don't change the issue. Yet the weird thing is, if I'm running the same base image via docker, the nvidia-smi command is recognized: sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.4.1-base-ubuntu18.04 cuda-11.4.1-base-ubuntu18.04 nvidia-smi returns

Tue Nov 22 15:10:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K2200        On   | 00000000:01:00.0 Off |                  N/A |
| 42%   42C    P8     1W /  39W |      1MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+                                                                         
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [x] The output of nvidia-smi -a on your host
Tue Nov 22 15:10:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K2200        On   | 00000000:01:00.0 Off |                  N/A |
| 42%   42C    P8     1W /  39W |      1MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |

  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  • [ ] The k8s-device-plugin container logs

  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

  • [ ] pod description

Name:             gpu
Namespace:        default
Priority:         0
Service Account:  default
Node:             
Start Time:       Tue, 22 Nov 2022 15:28:31 +0100
Labels:           <none>
Annotations:      <none>
Status:           Running
IP:               
IPs:
  IP:  
Containers:
  gpu:
    Container ID:  containerd://68707cec263eb1bfaec27357d9f6c07b2545278183fe875dd5f43ea5de77c1b3
    Image:         nvidia/cuda:11.4.1-base-ubuntu20.04
    Image ID:      docker.io/nvidia/cuda@sha256:a838c93bcb191de297b04a04b6dc8a7c50983243562201a8d057f3ccdb1e7276
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      while true; do sleep 30; done;
    State:          Running
      Started:      Tue, 22 Nov 2022 15:28:35 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wtv6z (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-wtv6z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  47m   default-scheduler  Successfully assigned default/gpu to vex-slave5
  Normal  Pulling    47m   kubelet            Pulling image "nvidia/cuda:11.4.1-base-ubuntu20.04"
  Normal  Pulled     47m   kubelet            Successfully pulled image "nvidia/cuda:11.4.1-base-ubuntu20.04" in 3.297786146s
  Normal  Created    47m   kubelet            Created container gpu
  Normal  Started    47m   kubelet            Started container gpu

Additional information that might help better understand your environment and reproduce the bug:

  • [ ] Docker version from docker version
Client:
 Version:           20.10.5+dfsg1
 API version:       1.41
 Go version:        go1.15.15
 Git commit:        55c4c88
 Built:             Mon May 30 18:34:49 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
  • [ ] Docker command, image and tag used
  • [ ] Kernel version from uname -a Linux vex-slave4 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux
  • [ ] Any relevant kernel output lines from dmesg
[ 1954.607181] cni0: port 1(vethe2dc367a) entered disabled state
[ 1954.608114] device vethe2dc367a left promiscuous mode
[ 1954.608118] cni0: port 1(vethe2dc367a) entered disabled state
[ 1957.373344] cni0: port 1(vethf4f0a873) entered blocking state
[ 1957.373346] cni0: port 1(vethf4f0a873) entered disabled state
[ 1957.374365] device vethf4f0a873 entered promiscuous mode
[ 1957.375452] cni0: port 2(veth01e926e2) entered blocking state
[ 1957.375454] cni0: port 2(veth01e926e2) entered disabled state
[ 1957.376797] device veth01e926e2 entered promiscuous mode
[ 1957.381302] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1957.381634] IPv6: ADDRCONF(NETDEV_CHANGE): vethf4f0a873: link becomes ready
[ 1957.381705] cni0: port 1(vethf4f0a873) entered blocking state
[ 1957.381706] cni0: port 1(vethf4f0a873) entered forwarding state
[ 1957.383274] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1957.383580] IPv6: ADDRCONF(NETDEV_CHANGE): veth01e926e2: link becomes ready
[ 1957.383648] cni0: port 2(veth01e926e2) entered blocking state
[ 1957.383650] cni0: port 2(veth01e926e2) entered forwarding state
[ 1957.570109] cni0: port 1(vethf4f0a873) entered disabled state
[ 1957.570963] device vethf4f0a873 left promiscuous mode
[ 1957.570966] cni0: port 1(vethf4f0a873) entered disabled state
[ 1957.602816] cni0: port 2(veth01e926e2) entered disabled state
[ 1957.603670] device veth01e926e2 left promiscuous mode
[ 1957.603672] cni0: port 2(veth01e926e2) entered disabled state
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version              Architecture Description
+++-======================================-====================-============-=================================================================
un  bumblebee-nvidia                       <none>               <none>       (no description available)
un  firmware-nvidia-gsp                    <none>               <none>       (no description available)
un  firmware-nvidia-gsp-470.141.03         <none>               <none>       (no description available)
ii  glx-alternative-nvidia                 1.2.1~deb11u1        amd64        allows the selection of NVIDIA as GLX provider
un  libegl-nvidia-legacy-390xx0            <none>               <none>       (no description available)
un  libegl-nvidia-tesla-418-0              <none>               <none>       (no description available)
un  libegl-nvidia-tesla-450-0              <none>               <none>       (no description available)
un  libegl-nvidia-tesla-470-0              <none>               <none>       (no description available)
ii  libegl-nvidia0:amd64                   470.141.03-1~deb11u1 amd64        NVIDIA binary EGL library
un  libegl1-glvnd-nvidia                   <none>               <none>       (no description available)
un  libegl1-nvidia                         <none>               <none>       (no description available)
un  libgl1-glvnd-nvidia-glx                <none>               <none>       (no description available)
ii  libgl1-nvidia-glvnd-glx:amd64          470.141.03-1~deb11u1 amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
un  libgl1-nvidia-glx                      <none>               <none>       (no description available)
un  libgl1-nvidia-glx-any                  <none>               <none>       (no description available)
un  libgl1-nvidia-glx-i386                 <none>               <none>       (no description available)
un  libgl1-nvidia-legacy-390xx-glx         <none>               <none>       (no description available)
un  libgl1-nvidia-tesla-418-glx            <none>               <none>       (no description available)
un  libgldispatch0-nvidia                  <none>               <none>       (no description available)
ii  libgles-nvidia1:amd64                  470.141.03-1~deb11u1 amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                  470.141.03-1~deb11u1 amd64        NVIDIA binary OpenGL|ES 2.x library
un  libgles1-glvnd-nvidia                  <none>               <none>       (no description available)
un  libgles2-glvnd-nvidia                  <none>               <none>       (no description available)
un  libglvnd0-nvidia                       <none>               <none>       (no description available)
ii  libglx-nvidia0:amd64                   470.141.03-1~deb11u1 amd64        NVIDIA binary GLX library
un  libglx0-glvnd-nvidia                   <none>               <none>       (no description available)
ii  libnvidia-cbl:amd64                    470.141.03-1~deb11u1 amd64        NVIDIA binary Vulkan ray tracing (cbl) library
un  libnvidia-cbl-470.141.03               <none>               <none>       (no description available)
un  libnvidia-cfg.so.1                     <none>               <none>       (no description available)
ii  libnvidia-cfg1:amd64                   470.141.03-1~deb11u1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                     <none>               <none>       (no description available)
ii  libnvidia-container-tools              1.11.0-1             amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.11.0-1             amd64        NVIDIA container runtime library
ii  libnvidia-egl-wayland1:amd64           1:1.1.5-1            amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-eglcore:amd64                470.141.03-1~deb11u1 amd64        NVIDIA binary EGL core libraries
un  libnvidia-eglcore-470.141.03           <none>               <none>       (no description available)
ii  libnvidia-encode1:amd64                470.141.03-1~deb11u1 amd64        NVENC Video Encoding runtime library
un  libnvidia-gl-390                       <none>               <none>       (no description available)
un  libnvidia-gl-410                       <none>               <none>       (no description available)
ii  libnvidia-glcore:amd64                 470.141.03-1~deb11u1 amd64        NVIDIA binary OpenGL/GLX core libraries
un  libnvidia-glcore-470.141.03            <none>               <none>       (no description available)
ii  libnvidia-glvkspirv:amd64              470.141.03-1~deb11u1 amd64        NVIDIA binary Vulkan Spir-V compiler library
un  libnvidia-glvkspirv-470.141.03         <none>               <none>       (no description available)
un  libnvidia-legacy-340xx-cfg1            <none>               <none>       (no description available)
un  libnvidia-legacy-390xx-cfg1            <none>               <none>       (no description available)
un  libnvidia-legacy-390xx-egl-wayland1    <none>               <none>       (no description available)
un  libnvidia-ml.so.1                      <none>               <none>       (no description available)
ii  libnvidia-ml1:amd64                    470.141.03-1~deb11u1 amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-ptxjitcompiler1:amd64        470.141.03-1~deb11u1 amd64        NVIDIA PTX JIT Compiler library
ii  libnvidia-rtcore:amd64                 470.141.03-1~deb11u1 amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
un  libnvidia-rtcore-470.141.03            <none>               <none>       (no description available)
un  libnvidia-tesla-418-cfg1               <none>               <none>       (no description available)
un  libnvidia-tesla-450-cfg1               <none>               <none>       (no description available)
un  libnvidia-tesla-470-cfg1               <none>               <none>       (no description available)
un  libnvidia-tesla-510-cfg1               <none>               <none>       (no description available)
un  libopengl0-glvnd-nvidia                <none>               <none>       (no description available)
ii  nvidia-alternative                     470.141.03-1~deb11u1 amd64        allows the selection of NVIDIA as GLX provider
un  nvidia-alternative--kmod-alias         <none>               <none>       (no description available)
un  nvidia-alternative-any                 <none>               <none>       (no description available)
un  nvidia-alternative-legacy-173xx        <none>               <none>       (no description available)
un  nvidia-alternative-legacy-71xx         <none>               <none>       (no description available)
un  nvidia-alternative-legacy-96xx         <none>               <none>       (no description available)
un  nvidia-container-runtime               <none>               <none>       (no description available)
un  nvidia-container-runtime-hook          <none>               <none>       (no description available)
ii  nvidia-container-toolkit               1.11.0-1             amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.11.0-1             amd64        NVIDIA Container Toolkit Base
un  nvidia-cuda-mps                        <none>               <none>       (no description available)
un  nvidia-current                         <none>               <none>       (no description available)
un  nvidia-current-updates                 <none>               <none>       (no description available)
ii  nvidia-detect                          470.141.03-1~deb11u1 amd64        NVIDIA GPU detection utility
un  nvidia-docker                          <none>               <none>       (no description available)
ii  nvidia-docker2                         2.11.0-1             all          nvidia-docker CLI wrapper
ii  nvidia-driver                          470.141.03-1~deb11u1 amd64        NVIDIA metapackage
un  nvidia-driver-any                      <none>               <none>       (no description available)
ii  nvidia-driver-bin                      470.141.03-1~deb11u1 amd64        NVIDIA driver support binaries
un  nvidia-driver-bin-470.141.03           <none>               <none>       (no description available)
un  nvidia-driver-binary                   <none>               <none>       (no description available)
ii  nvidia-driver-libs:amd64               470.141.03-1~deb11u1 amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
un  nvidia-driver-libs-any                 <none>               <none>       (no description available)
un  nvidia-driver-libs-nonglvnd            <none>               <none>       (no description available)
ii  nvidia-egl-common                      470.141.03-1~deb11u1 amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                   470.141.03-1~deb11u1 amd64        NVIDIA EGL installable client driver (ICD)
un  nvidia-egl-wayland-common              <none>               <none>       (no description available)
un  nvidia-glx-any                         <none>               <none>       (no description available)
ii  nvidia-installer-cleanup               20151021+13          amd64        cleanup after driver installation with the nvidia-installer
un  nvidia-kernel-470.141.03               <none>               <none>       (no description available)
ii  nvidia-kernel-common                   20151021+13          amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                     470.141.03-1~deb11u1 amd64        NVIDIA binary kernel module DKMS source
un  nvidia-kernel-source                   <none>               <none>       (no description available)
ii  nvidia-kernel-support                  470.141.03-1~deb11u1 amd64        NVIDIA binary kernel module support files
un  nvidia-kernel-support--v1              <none>               <none>       (no description available)
un  nvidia-kernel-support-any              <none>               <none>       (no description available)
un  nvidia-legacy-304xx-alternative        <none>               <none>       (no description available)
un  nvidia-legacy-304xx-driver             <none>               <none>       (no description available)
un  nvidia-legacy-340xx-alternative        <none>               <none>       (no description available)
un  nvidia-legacy-390xx-vulkan-icd         <none>               <none>       (no description available)
ii  nvidia-legacy-check                    470.141.03-1~deb11u1 amd64        check for NVIDIA GPUs requiring a legacy driver
ii  nvidia-modprobe                        470.103.01-1~deb11u1 amd64        utility to load NVIDIA kernel modules and create device nodes
un  nvidia-nonglvnd-vulkan-common          <none>               <none>       (no description available)
un  nvidia-nonglvnd-vulkan-icd             <none>               <none>       (no description available)
ii  nvidia-persistenced                    470.103.01-2~deb11u1 amd64        daemon to maintain persistent software state in the NVIDIA driver
ii  nvidia-settings                        470.141.03-1~deb11u1 amd64        tool for configuring the NVIDIA graphics driver
un  nvidia-settings-gtk-470.141.03         <none>               <none>       (no description available)
ii  nvidia-smi                             470.141.03-1~deb11u1 amd64        NVIDIA System Management Interface
ii  nvidia-support                         20151021+13          amd64        NVIDIA binary graphics driver support files
un  nvidia-tesla-418-vulkan-icd            <none>               <none>       (no description available)
un  nvidia-tesla-450-vulkan-icd            <none>               <none>       (no description available)
un  nvidia-tesla-470-vulkan-icd            <none>               <none>       (no description available)
un  nvidia-tesla-alternative               <none>               <none>       (no description available)
ii  nvidia-vdpau-driver:amd64              470.141.03-1~deb11u1 amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-vulkan-common                   470.141.03-1~deb11u1 amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                470.141.03-1~deb11u1 amd64        NVIDIA Vulkan installable client driver (ICD)
un  nvidia-vulkan-icd-any                  <none>               <none>       (no description available)
ii  xserver-xorg-video-nvidia              470.141.03-1~deb11u1 amd64        NVIDIA binary Xorg driver
un  xserver-xorg-video-nvidia-any          <none>               <none>       (no description available)
un  xserver-xorg-video-nvidia-legacy-304xx <none>               <none>       (no description available)
  • [ ] NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.11.0
lib-version: 1.11.0
build date: 2022-09-06T09:21+00:00
build revision: c8f267be0bac1c654d59ad4ea5df907141149977
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

devriesewouter89 avatar Nov 22 '22 15:11 devriesewouter89

I'm assuming you are using containerd, not docker as the runtime you have configured for kubernetes (that has been the default since v1.20).

Do you have nvidia set up as your default runtime for containerd as described here: https://github.com/NVIDIA/k8s-device-plugin#configure-containerd

The path used by ctr and the way kubernetes hooks into containerd are different ,so if it works under ctr that doesn't mean it will work under k8s. You need to have containerd's cri plugin configured to use the nvidia runtime by default, as described in the link above.

klueska avatar Nov 22 '22 15:11 klueska

@devriesewouter89 note that k3s uses a speicific containerd config template and configures the NVIDIA Container Runtime if this is installed on the system at startup. Note that this doesn't set the default runtime. One option is to use a RuntimeClass when launching pods that are supposed to use have access to GPUs.

elezar avatar Nov 30 '22 14:11 elezar

See also https://github.com/NVIDIA/k8s-device-plugin/issues/306

elezar avatar Nov 30 '22 14:11 elezar

Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Then for your containers you can use that class specifically as:

spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:

AgentScrubbles avatar Dec 24 '23 17:12 AgentScrubbles

Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Then for your containers you can use that class specifically as:

spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:

Thank you so much! You just ended my hours-long search. I appreciate you taking the time to help us newbies out.

Lodeon avatar May 01 '24 23:05 Lodeon

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 11 '25 04:02 github-actions[bot]