k8s-device-plugin
k8s-device-plugin copied to clipboard
Plug in does not detect Tegra device Jetson Nano
1. Issue or feature description
Guys, I'm loosing my mind. I have k3s cluster running 3x rpi CM4 and one Jetson Nano.
Runtime environment was detected when I installed the K3s ok, and added to /var/lib/rancher/k3s/agent/etc/containerd/config.toml
version = 2
[plugins."io.containerd.internal.v1.opt"]
path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
enable_unprivileged_ports = false
enable_unprivileged_icmp = false
sandbox_image = "rancher/mirrored-pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/var/lib/rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin"
conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
And I can run and detect the GPU in docker and containerd just fine:
docker run --rm --runtime nvidia xift/jetson_devicequery:r32.5.0 or ctr i pull docker.io/xift/jetson_devicequery:r32.5.0 ctr run --rm --gpus 0 --tty docker.io/xift/jetson_devicequery:r32.5.0 deviceQuery
Returns:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3963 MBytes (4155203584 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
When I then install nvidia-device-plugin with:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
The DaemonSets will not detect GPU on the fourth node.
2023/02/03 10:16:37 Starting FS watcher.
2023/02/03 10:16:37 Starting OS watcher.
2023/02/03 10:16:37 Starting Plugins.
2023/02/03 10:16:37 Loading configuration.
2023/02/03 10:16:37 Updating config with default resource matching patterns.
2023/02/03 10:16:37
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2023/02/03 10:16:37 Retreiving plugins.
2023/02/03 10:16:37 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/03 10:16:37 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/02/03 10:16:37 Incompatible platform detected
2023/02/03 10:16:37 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2023/02/03 10:16:37 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023/02/03 10:16:37 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2023/02/03 10:16:37 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2023/02/03 10:16:37 No devices found. Waiting indefinitely.
I feel Im close but for life of me, i can't get this to work :(
I have tried to deploy and force the same image that worked locally to jetson, but it will fail:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-query
spec:
restartPolicy: OnFailure
nodeSelector:
node-type: jetson
containers:
- name: nvidia-query
image: xift/jetson_devicequery:r32.5.0
command: [ "./deviceQuery" ]
root@cube01:~# kubectl logs nvidia-query
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
I almost feels like there are two containerd one that works when I use ctr on that node, and one separated for k3s, or something... can't explain why using the same containerd engine produce two different results.
3. Information to attach (optional if deemed irrelevant)
vladoportos@cube04:/sys/devices/gpu.0$ dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==========================================================-==================================-==================================-=========================================================================================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-container-tools 1.7.0-1 arm64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container0:arm64 0.10.0+jetpack arm64 NVIDIA container runtime library
ii libnvidia-container1:arm64 1.7.0-1 arm64 NVIDIA container runtime library
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
ii nvidia-container-csv-cuda 10.2.460-1 arm64 Jetpack CUDA CSV file
ii nvidia-container-csv-cudnn 8.2.1.32-1+cuda10.2 arm64 Jetpack CUDNN CSV file
ii nvidia-container-csv-tensorrt 8.2 arm64 Jetpack TensorRT CSV file
ii nvidia-container-runtime 3.7.0-1 all NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.7.0-1 arm64 NVIDIA container runtime hook
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.8.0-1 all nvidia-docker CLI wrapper
ii nvidia-l4t-3d-core 32.7.3-20221122092935 arm64 NVIDIA GL EGL Package
ii nvidia-l4t-apt-source 32.7.3-20221122092935 arm64 NVIDIA L4T apt source list debian package
ii nvidia-l4t-bootloader 32.7.3-20221122092935 arm64 NVIDIA Bootloader Package
ii nvidia-l4t-camera 32.7.3-20221122092935 arm64 NVIDIA Camera Package
un nvidia-l4t-ccp-t210ref <none> <none> (no description available)
ii nvidia-l4t-configs 32.7.3-20221122092935 arm64 NVIDIA configs debian package
ii nvidia-l4t-core 32.7.3-20221122092935 arm64 NVIDIA Core Package
ii nvidia-l4t-cuda 32.7.3-20221122092935 arm64 NVIDIA CUDA Package
ii nvidia-l4t-firmware 32.7.3-20221122092935 arm64 NVIDIA Firmware Package
ii nvidia-l4t-gputools 32.7.3-20221122092935 arm64 NVIDIA dgpu helper Package
ii nvidia-l4t-graphics-demos 32.7.3-20221122092935 arm64 NVIDIA graphics demo applications
ii nvidia-l4t-gstreamer 32.7.3-20221122092935 arm64 NVIDIA GST Application files
ii nvidia-l4t-init 32.7.3-20221122092935 arm64 NVIDIA Init debian package
ii nvidia-l4t-initrd 32.7.3-20221122092935 arm64 NVIDIA initrd debian package
ii nvidia-l4t-jetson-io 32.7.3-20221122092935 arm64 NVIDIA Jetson.IO debian package
ii nvidia-l4t-jetson-multimedia-api 32.7.3-20221122092935 arm64 NVIDIA Jetson Multimedia API is a collection of lower-level APIs that support flexible application development.
ii nvidia-l4t-kernel 4.9.299-tegra-32.7.3-2022112209293 arm64 NVIDIA Kernel Package
ii nvidia-l4t-kernel-dtbs 4.9.299-tegra-32.7.3-2022112209293 arm64 NVIDIA Kernel DTB Package
ii nvidia-l4t-kernel-headers 4.9.299-tegra-32.7.3-2022112209293 arm64 NVIDIA Linux Tegra Kernel Headers Package
ii nvidia-l4t-libvulkan 32.7.3-20221122092935 arm64 NVIDIA Vulkan Loader Package
ii nvidia-l4t-multimedia 32.7.3-20221122092935 arm64 NVIDIA Multimedia Package
ii nvidia-l4t-multimedia-utils 32.7.3-20221122092935 arm64 NVIDIA Multimedia Package
ii nvidia-l4t-oem-config 32.7.3-20221122092935 arm64 NVIDIA OEM-Config Package
ii nvidia-l4t-tools 32.7.3-20221122092935 arm64 NVIDIA Public Test Tools Package
ii nvidia-l4t-wayland 32.7.3-20221122092935 arm64 NVIDIA Wayland Package
ii nvidia-l4t-weston 32.7.3-20221122092935 arm64 NVIDIA Weston Package
ii nvidia-l4t-x11 32.7.3-20221122092935 arm64 NVIDIA X11 Package
ii nvidia-l4t-xusb-firmware 32.7.3-20221122092935 arm64 NVIDIA USB Firmware Package
un nvidia-libopencl1-dev <none> <none> (no description available)
un nvidia-prime <none> <none> (no description available)
vladoportos@cube04:/sys/devices/gpu.0$ nvidia-container-cli -V
cli-version: 1.7.0
lib-version: 0.10.0+jetpack
build date: 2021-11-30T19:53+00:00
build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
build compiler: aarch64-linux-gnu-gcc-7 7.5.0
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
root@cube04:~# systemctl status k3s-agent
● k3s-agent.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2023-02-03 10:08:05 CET; 1h 16min ago
Docs: https://k3s.io
Process: 5445 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Process: 5440 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 5418 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Main PID: 5446 (k3s-agent)
Tasks: 116
CGroup: /system.slice/k3s-agent.service
├─ 5446 /usr/local/bin/k3s agent
├─ 5478 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
├─ 7283 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id dedf850196ac01cba261dd25152e1ec1081487e0027c9cd7335280b9046cb754 -address /run/k3s/co
├─ 7427 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 8ccc0ae4a3d12bd447314bcdddc78524412c67d12d4d21dabc4b43fc6c4e5557 -address /run/k3s/co
├─ 8325 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id f4e8054e372b87f4799fad81b0aa15c187dc2abc1e64161c66950679048d3219 -address /run/k3s/co
├─ 9034 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id caa99803264ae2123a9dcd0b0600dc314dfbe052ae21ffad53d0b91826379d3c -address /run/k3s/co
├─ 9223 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 207f091f8498b63b7f7e14d6533216c75728eb57ecae1d1dc358e7b8bcf9ad76 -address /run/k3s/co
└─16746 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id ebecb9818d2a4356fe7d60e601f45b6794477c7eb27c9455440691ce5a9d64ad -address /run/k3s/co
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.065894 5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.065973 5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.066050 5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.066124 5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.239639 5446 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-kkbwv\" (UniqueName: \"kubernetes.io/projected/ac4fe6a7-039b-44db-b02f-b992f3463d0d-ku
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.241210 5446 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"device-plugin\" (UniqueName: \"kubernetes.io/host-path/ac4fe6a7-039b-44db-b02f-b992f3463d0d-device-plu
Feb 03 11:18:05 cube04 k3s[5446]: W0203 11:18:05.787683 5446 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Feb 03 11:18:05 cube04 k3s[5446]: W0203 11:18:05.790051 5446 machine.go:65] Cannot read vendor id correctly, set empty.
Feb 03 11:23:05 cube04 k3s[5446]: W0203 11:23:05.786897 5446 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Feb 03 11:23:05 cube04 k3s[5446]: W0203 11:23:05.788848 5446 machine.go:65] Cannot read vendor id correctly, set empty.
Your containerd config is not setting nvidia as the default runtime. The only reason ctr works is that it goes through a different path (I.e. not the CRI plugin like Kubernetes does), and does not require nvidia be set as the default runtime to work (it keys off of the fact that you passed —gpus to know what to do with the nvidia tooling).
@klueska Ah, ok I have edited the containerd to use nvidia as default but that kind of moved me forward but still failed:
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
default_runtime_name = "nvidia"
Name: nvidia-query
Namespace: default
Priority: 0
Service Account: default
Node: cube04/10.0.0.63
Start Time: Fri, 03 Feb 2023 11:52:26 +0100
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.42.1.13
IPs:
IP: 10.42.1.13
Containers:
nvidia-query:
Container ID: containerd://a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07
Image: xift/jetson_devicequery:r32.5.0
Image ID: docker.io/xift/jetson_devicequery@sha256:8a4db3a25008e9ae2ce265b70389b53110b7625eaef101794af05433024c47ee
Port: <none>
Host Port: <none>
Command:
./deviceQuery
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout:
src: /etc/vulkan/icd.d/nvidia_icd.json, src_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/etc/vulkan/i
cd.d/nvidia_icd.json, dst_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json
src: /usr/lib/aarch64-linux-gnu/libcuda.so, src_lnk: tegra/libcuda.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/libcuda.so, ds
t_lnk: tegra/libcuda.so
src: /usr/lib/aarch64-linux-gnu/libdrm_nvdc.so, src_lnk: tegra/libdrm.so.2, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/libdrm_nv
dc.so, dst_lnk: tegra/libdrm.so.2
src: /usr/lib/aarch64-linux-gnu/libv4l2.so.0.0.999999, src_lnk: tegra/libnvv4l2.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/l
ibv4l2.so.0.0.999999, dst_lnk: tegra/libnvv4l2.so
src: /usr/lib/aarch64-linux-gnu/libv4lconvert.so.0.0.999999, src_lnk: tegra/libnvv4lconvert.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64
-linux-gnu/libv4lconvert.so.0.0.999999, dst_lnk: tegra/libnvv4lconvert.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, src_lnk: ../../../tegra/libv4l2_nvargus.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/root
fs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, dst_lnk: ../../../tegra/libv4l2_nvargus.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, src_lnk: ../../../tegra/libv4l2_nvvidconv.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/
rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, dst_lnk: ../../../tegra/libv4l2_nvvidconv.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, src_lnk: ../../../tegra/libv4l2_nvvideocodec.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02
cae07/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, dst_lnk: ../../../tegra/libv4l2_nvvideocodec.so
src: /usr/lib/aarch64-linux-gnu/libvulkan.so.1.2.141, src_lnk: tegra/libvulkan.so.1.2.141, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linu
x-gnu/libvulkan.so.1.2.141, dst_lnk: tegra/libvulkan.so.1.2.141
src: /usr/lib/aarch64-linux-gnu/tegra/libcuda.so, src_lnk: libcuda.so.1.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/tegra/libc
uda.so, dst_lnk: libcuda.so.1.1
And the DaemonSet fails with:
src: /usr/lib/aarch64-linux-gnu/libcudnn_static.a, src_lnk: /etc/alternatives/libcudnn_stlib, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_static.a, dst_lnk: /etc/alternatives/libcudnn_stlib
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so.8, src_lnk: libnvinfer.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so.8, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so.8, src_lnk: libnvparsers.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so.8, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, src_lnk: libnvonnxparser.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, dst_lnk: libnvonnxparser.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so, src_lnk: libnvinfer.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so, src_lnk: libnvparsers.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so, src_lnk: libnvonnxparser.so.8, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so, dst_lnk: libnvonnxparser.so.8
, stderr: nvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/system.slice/k3s-agent.service/kubepods-besteffort-pod541c5001_1e8f_4e6a_9976_ffd80e364373.slice/devices.allow: no such file or directory: unknown
Warning BackOff 8s (x8 over 107s) kubelet Back-off restarting failed container
@elezar is this most recent error fixed by the new toolkit?
@VladoPortos while we wait for Evan to confirm, can you try installing the latest RC of the nvidia-container-toolkit (I believe it’s 1.12-rc.5) to see if this resolves your issue.
Holly Cow ! it worked ! Thansk soooo much.
I can confirm, updating the repo to experimental and installing:
nvidia-container-toolkit (1.12.0~rc.5-1)
nvidia-container-runtime (3.11.0-1)
nvidia-docker2 (2.11.0-1)
Now the container in k3s works and returns:
root@cube01:~# kubectl logs nvidia-query
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3963 MBytes (4155203584 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Same for the nvidia plugin:
root@cube01:~# kubectl logs nvidia-device-plugin-daemonset-d7zj6 -n kube-system
2023/02/03 11:21:41 Starting FS watcher.
2023/02/03 11:21:41 Starting OS watcher.
2023/02/03 11:21:41 Starting Plugins.
2023/02/03 11:21:41 Loading configuration.
2023/02/03 11:21:41 Updating config with default resource matching patterns.
2023/02/03 11:21:41
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2023/02/03 11:21:41 Retreiving plugins.
2023/02/03 11:21:41 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/03 11:21:41 Detected Tegra platform: /etc/nv_tegra_release found
2023/02/03 11:21:41 Starting GRPC server for 'nvidia.com/gpu'
2023/02/03 11:21:41 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/02/03 11:21:41 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Great to hear. We should actually be pushing the GA release of 1.12 later today, so you don’t have to run off the RC for long.
Great to hear. We should actually be pushing the GA release of 1.12 later today, so you don’t have to run off the RC for long.
Any timeline on when the release of 1.12 will happen as I don't see it when I do an apt update.
It was released last Friday.
Note that looking at the initial logs that you provided you may have been using v1.7.0
of the NVIDIA Container Toolkit. This is quite an old version and we greatly improved our support for Tegra-based systems with the v1.10.0
release. It should also be noted that in order to use the GPU Device Plugin on Tegra-based systems (specifically targetting the integrated GPUs) at least v1.11.0
of the NVIDIA Container Toolkit is required.
There are no Tegra-specific changes in the v1.12.0
release, so using the v1.11.0
release should be sufficient in this case.
It appears that I have v1.7.0 of the NVIDIA Container Toolkit and when I do an "apt upgrade" I'm not seeing any newer versions of the NVIDIA Container Toolkit. How does one get one of the newer versions of the NVIDIA Container Toolkit that will allow this to work with Jetson Nano?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.