"nvidia-smi": executable file not found in $PATH: unknown
1. Issue or feature description
When booting a container on k8s (via k3s) I notice my container doesn't contain "nvidia-smi" in /usr/bin or elsewhere. When I launch the same image/container not via k8s I do get the command.
vex@vex-slave4:~$ kubectl exec -it gpu -- nvidia-smi
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "447b3dd0509b66403603e0c66fa7c524259d111afc3db4c41ce59498d58bb8c6": OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
2. Steps to reproduce the issue
my deployment yaml:
apiVersion: v1
kind: Pod
metadata:
name: gpu
spec:
restartPolicy: Never
containers:
- name: gpu
image: "nvidia/cuda:11.4.1-base-ubuntu20.04"
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
resources:
limits:
nvidia.com/gpu: 1
Using different base images don't change the issue. Yet the weird thing is, if I'm running the same base image via docker, the nvidia-smi command is recognized:
sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.4.1-base-ubuntu18.04 cuda-11.4.1-base-ubuntu18.04 nvidia-smi returns
Tue Nov 22 15:10:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro K2200 On | 00000000:01:00.0 Off | N/A |
| 42% 42C P8 1W / 39W | 1MiB / 4043MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [x] The output of
nvidia-smi -aon your host
Tue Nov 22 15:10:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro K2200 On | 00000000:01:00.0 Off | N/A |
| 42% 42C P8 1W / 39W | 1MiB / 4043MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json)
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
-
[ ] The k8s-device-plugin container logs
-
[ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet) -
[ ] pod description
Name: gpu
Namespace: default
Priority: 0
Service Account: default
Node:
Start Time: Tue, 22 Nov 2022 15:28:31 +0100
Labels: <none>
Annotations: <none>
Status: Running
IP:
IPs:
IP:
Containers:
gpu:
Container ID: containerd://68707cec263eb1bfaec27357d9f6c07b2545278183fe875dd5f43ea5de77c1b3
Image: nvidia/cuda:11.4.1-base-ubuntu20.04
Image ID: docker.io/nvidia/cuda@sha256:a838c93bcb191de297b04a04b6dc8a7c50983243562201a8d057f3ccdb1e7276
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 30; done;
State: Running
Started: Tue, 22 Nov 2022 15:28:35 +0100
Ready: True
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wtv6z (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-wtv6z:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47m default-scheduler Successfully assigned default/gpu to vex-slave5
Normal Pulling 47m kubelet Pulling image "nvidia/cuda:11.4.1-base-ubuntu20.04"
Normal Pulled 47m kubelet Successfully pulled image "nvidia/cuda:11.4.1-base-ubuntu20.04" in 3.297786146s
Normal Created 47m kubelet Created container gpu
Normal Started 47m kubelet Started container gpu
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
Client:
Version: 20.10.5+dfsg1
API version: 1.41
Go version: go1.15.15
Git commit: 55c4c88
Built: Mon May 30 18:34:49 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -aLinux vex-slave4 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux - [ ] Any relevant kernel output lines from
dmesg
[ 1954.607181] cni0: port 1(vethe2dc367a) entered disabled state
[ 1954.608114] device vethe2dc367a left promiscuous mode
[ 1954.608118] cni0: port 1(vethe2dc367a) entered disabled state
[ 1957.373344] cni0: port 1(vethf4f0a873) entered blocking state
[ 1957.373346] cni0: port 1(vethf4f0a873) entered disabled state
[ 1957.374365] device vethf4f0a873 entered promiscuous mode
[ 1957.375452] cni0: port 2(veth01e926e2) entered blocking state
[ 1957.375454] cni0: port 2(veth01e926e2) entered disabled state
[ 1957.376797] device veth01e926e2 entered promiscuous mode
[ 1957.381302] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1957.381634] IPv6: ADDRCONF(NETDEV_CHANGE): vethf4f0a873: link becomes ready
[ 1957.381705] cni0: port 1(vethf4f0a873) entered blocking state
[ 1957.381706] cni0: port 1(vethf4f0a873) entered forwarding state
[ 1957.383274] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1957.383580] IPv6: ADDRCONF(NETDEV_CHANGE): veth01e926e2: link becomes ready
[ 1957.383648] cni0: port 2(veth01e926e2) entered blocking state
[ 1957.383650] cni0: port 2(veth01e926e2) entered forwarding state
[ 1957.570109] cni0: port 1(vethf4f0a873) entered disabled state
[ 1957.570963] device vethf4f0a873 left promiscuous mode
[ 1957.570966] cni0: port 1(vethf4f0a873) entered disabled state
[ 1957.602816] cni0: port 2(veth01e926e2) entered disabled state
[ 1957.603670] device veth01e926e2 left promiscuous mode
[ 1957.603672] cni0: port 2(veth01e926e2) entered disabled state
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-======================================-====================-============-=================================================================
un bumblebee-nvidia <none> <none> (no description available)
un firmware-nvidia-gsp <none> <none> (no description available)
un firmware-nvidia-gsp-470.141.03 <none> <none> (no description available)
ii glx-alternative-nvidia 1.2.1~deb11u1 amd64 allows the selection of NVIDIA as GLX provider
un libegl-nvidia-legacy-390xx0 <none> <none> (no description available)
un libegl-nvidia-tesla-418-0 <none> <none> (no description available)
un libegl-nvidia-tesla-450-0 <none> <none> (no description available)
un libegl-nvidia-tesla-470-0 <none> <none> (no description available)
ii libegl-nvidia0:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL library
un libegl1-glvnd-nvidia <none> <none> (no description available)
un libegl1-nvidia <none> <none> (no description available)
un libgl1-glvnd-nvidia-glx <none> <none> (no description available)
ii libgl1-nvidia-glvnd-glx:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant)
un libgl1-nvidia-glx <none> <none> (no description available)
un libgl1-nvidia-glx-any <none> <none> (no description available)
un libgl1-nvidia-glx-i386 <none> <none> (no description available)
un libgl1-nvidia-legacy-390xx-glx <none> <none> (no description available)
un libgl1-nvidia-tesla-418-glx <none> <none> (no description available)
un libgldispatch0-nvidia <none> <none> (no description available)
ii libgles-nvidia1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL|ES 1.x library
ii libgles-nvidia2:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL|ES 2.x library
un libgles1-glvnd-nvidia <none> <none> (no description available)
un libgles2-glvnd-nvidia <none> <none> (no description available)
un libglvnd0-nvidia <none> <none> (no description available)
ii libglx-nvidia0:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary GLX library
un libglx0-glvnd-nvidia <none> <none> (no description available)
ii libnvidia-cbl:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan ray tracing (cbl) library
un libnvidia-cbl-470.141.03 <none> <none> (no description available)
un libnvidia-cfg.so.1 <none> <none> (no description available)
ii libnvidia-cfg1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
ii libnvidia-egl-wayland1:amd64 1:1.1.5-1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-eglcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL core libraries
un libnvidia-eglcore-470.141.03 <none> <none> (no description available)
ii libnvidia-encode1:amd64 470.141.03-1~deb11u1 amd64 NVENC Video Encoding runtime library
un libnvidia-gl-390 <none> <none> (no description available)
un libnvidia-gl-410 <none> <none> (no description available)
ii libnvidia-glcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX core libraries
un libnvidia-glcore-470.141.03 <none> <none> (no description available)
ii libnvidia-glvkspirv:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan Spir-V compiler library
un libnvidia-glvkspirv-470.141.03 <none> <none> (no description available)
un libnvidia-legacy-340xx-cfg1 <none> <none> (no description available)
un libnvidia-legacy-390xx-cfg1 <none> <none> (no description available)
un libnvidia-legacy-390xx-egl-wayland1 <none> <none> (no description available)
un libnvidia-ml.so.1 <none> <none> (no description available)
ii libnvidia-ml1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA Management Library (NVML) runtime library
ii libnvidia-ptxjitcompiler1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA PTX JIT Compiler library
ii libnvidia-rtcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library
un libnvidia-rtcore-470.141.03 <none> <none> (no description available)
un libnvidia-tesla-418-cfg1 <none> <none> (no description available)
un libnvidia-tesla-450-cfg1 <none> <none> (no description available)
un libnvidia-tesla-470-cfg1 <none> <none> (no description available)
un libnvidia-tesla-510-cfg1 <none> <none> (no description available)
un libopengl0-glvnd-nvidia <none> <none> (no description available)
ii nvidia-alternative 470.141.03-1~deb11u1 amd64 allows the selection of NVIDIA as GLX provider
un nvidia-alternative--kmod-alias <none> <none> (no description available)
un nvidia-alternative-any <none> <none> (no description available)
un nvidia-alternative-legacy-173xx <none> <none> (no description available)
un nvidia-alternative-legacy-71xx <none> <none> (no description available)
un nvidia-alternative-legacy-96xx <none> <none> (no description available)
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base
un nvidia-cuda-mps <none> <none> (no description available)
un nvidia-current <none> <none> (no description available)
un nvidia-current-updates <none> <none> (no description available)
ii nvidia-detect 470.141.03-1~deb11u1 amd64 NVIDIA GPU detection utility
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver 470.141.03-1~deb11u1 amd64 NVIDIA metapackage
un nvidia-driver-any <none> <none> (no description available)
ii nvidia-driver-bin 470.141.03-1~deb11u1 amd64 NVIDIA driver support binaries
un nvidia-driver-bin-470.141.03 <none> <none> (no description available)
un nvidia-driver-binary <none> <none> (no description available)
ii nvidia-driver-libs:amd64 470.141.03-1~deb11u1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
un nvidia-driver-libs-any <none> <none> (no description available)
un nvidia-driver-libs-nonglvnd <none> <none> (no description available)
ii nvidia-egl-common 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL driver - common files
ii nvidia-egl-icd:amd64 470.141.03-1~deb11u1 amd64 NVIDIA EGL installable client driver (ICD)
un nvidia-egl-wayland-common <none> <none> (no description available)
un nvidia-glx-any <none> <none> (no description available)
ii nvidia-installer-cleanup 20151021+13 amd64 cleanup after driver installation with the nvidia-installer
un nvidia-kernel-470.141.03 <none> <none> (no description available)
ii nvidia-kernel-common 20151021+13 amd64 NVIDIA binary kernel module support files
ii nvidia-kernel-dkms 470.141.03-1~deb11u1 amd64 NVIDIA binary kernel module DKMS source
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-support 470.141.03-1~deb11u1 amd64 NVIDIA binary kernel module support files
un nvidia-kernel-support--v1 <none> <none> (no description available)
un nvidia-kernel-support-any <none> <none> (no description available)
un nvidia-legacy-304xx-alternative <none> <none> (no description available)
un nvidia-legacy-304xx-driver <none> <none> (no description available)
un nvidia-legacy-340xx-alternative <none> <none> (no description available)
un nvidia-legacy-390xx-vulkan-icd <none> <none> (no description available)
ii nvidia-legacy-check 470.141.03-1~deb11u1 amd64 check for NVIDIA GPUs requiring a legacy driver
ii nvidia-modprobe 470.103.01-1~deb11u1 amd64 utility to load NVIDIA kernel modules and create device nodes
un nvidia-nonglvnd-vulkan-common <none> <none> (no description available)
un nvidia-nonglvnd-vulkan-icd <none> <none> (no description available)
ii nvidia-persistenced 470.103.01-2~deb11u1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-settings 470.141.03-1~deb11u1 amd64 tool for configuring the NVIDIA graphics driver
un nvidia-settings-gtk-470.141.03 <none> <none> (no description available)
ii nvidia-smi 470.141.03-1~deb11u1 amd64 NVIDIA System Management Interface
ii nvidia-support 20151021+13 amd64 NVIDIA binary graphics driver support files
un nvidia-tesla-418-vulkan-icd <none> <none> (no description available)
un nvidia-tesla-450-vulkan-icd <none> <none> (no description available)
un nvidia-tesla-470-vulkan-icd <none> <none> (no description available)
un nvidia-tesla-alternative <none> <none> (no description available)
ii nvidia-vdpau-driver:amd64 470.141.03-1~deb11u1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver
ii nvidia-vulkan-common 470.141.03-1~deb11u1 amd64 NVIDIA Vulkan driver - common files
ii nvidia-vulkan-icd:amd64 470.141.03-1~deb11u1 amd64 NVIDIA Vulkan installable client driver (ICD)
un nvidia-vulkan-icd-any <none> <none> (no description available)
ii xserver-xorg-video-nvidia 470.141.03-1~deb11u1 amd64 NVIDIA binary Xorg driver
un xserver-xorg-video-nvidia-any <none> <none> (no description available)
un xserver-xorg-video-nvidia-legacy-304xx <none> <none> (no description available)
- [ ] NVIDIA container library version from
nvidia-container-cli -V
cli-version: 1.11.0
lib-version: 1.11.0
build date: 2022-09-06T09:21+00:00
build revision: c8f267be0bac1c654d59ad4ea5df907141149977
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [ ] NVIDIA container library logs (see troubleshooting)
I'm assuming you are using containerd, not docker as the runtime you have configured for kubernetes (that has been the default since v1.20).
Do you have nvidia set up as your default runtime for containerd as described here:
https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
The path used by ctr and the way kubernetes hooks into containerd are different ,so if it works under ctr that doesn't mean it will work under k8s. You need to have containerd's cri plugin configured to use the nvidia runtime by default, as described in the link above.
@devriesewouter89 note that k3s uses a speicific containerd config template and configures the NVIDIA Container Runtime if this is installed on the system at startup. Note that this doesn't set the default runtime. One option is to use a RuntimeClass when launching pods that are supposed to use have access to GPUs.
See also https://github.com/NVIDIA/k8s-device-plugin/issues/306
Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
Then for your containers you can use that class specifically as:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidiaThen for your containers you can use that class specifically as:
spec: restartPolicy: OnFailure runtimeClassName: nvidia containers:
Thank you so much! You just ended my hours-long search. I appreciate you taking the time to help us newbies out.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.