k8s-device-plugin
k8s-device-plugin copied to clipboard
Docker image for Nvidia Jetson Nano
Hi all, I have a feature request for using this plugin on Jetson Nano. I hope you could help, and I'd be happy to test !
1. Issue or feature description
I'd like to use Kubernetes on Nvidia Jetson Nano(s) With July 2019 release of Jetpack, nvidia-docker (in beta) works fine and is set as my default docker runtime. But it seems that nvidia/k8s-device-plugin container image is NOT distributed for ARM64/v8 architecture. (neither for the version specified in nvidia-device-plugin.yaml nor in any version on Docker hub : only x86_64 & ppc64le) https://hub.docker.com/r/nvidia/k8s-device-plugin
2. Steps to reproduce the issue
On a Jetson Nano, (with docker daeamon started and current user in docker group)
vdups@jetsonk3sgpu01:~$ grep image: /tmp/nvidia-device-plugin.yaml - image: nvidia/k8s-device-plugin:1.0.0-beta vdups@jetsonk3sgpu01:~$ docker pull nvidia/k8s-device-plugin:1.0.0-beta 1.0.0-beta: **Pulling from nvidia/k8s-device-plugin no matching manifest for linux/arm64/v8 in the manifest list entries**
3. Information to attach (optional if deemed irrelevant)
Jetson Nano base image used : r322 downloaded in July 2019 from https://developer.nvidia.com/jetson-nano-sd-card-image-r322
I hope you could provide some help. I did not found similar open issues in github.
Common error checking:
- [ ] The output of
nvidia-smi -a
on your host - [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
) - [ ] The k8s-device-plugin container logs
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
Additional information that might help better understand your environment and reproduce the bug:
-
[ ] Docker version from
docker version
-
[ ] Docker command, image and tag used
-
[X ] Kernel version from
uname -a
vdups@jetsonk3sgpu01:~$ uname -a Linux jetsonk3sgpu01 4.9.140-tegra #1 SMP PREEMPT Tue Jul 16 17:04:49 PDT 2019 aarch64 aarch64 aarch64 GNU/Linux
-
[ ] Any relevant kernel output lines from
dmesg
-
[X ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
vdups@jetsonk3sgpu01:~$ dpkg -l *nvidia* Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================-================-================-================================================= un libgldispatch0-nvidia <none> <none> (no description available) ii libnvidia-container-to 0.9.0~beta.1 arm64 NVIDIA container runtime library (command-line to ii libnvidia-container0:a 0.9.0~beta.1 arm64 NVIDIA container runtime library un nvidia-304 <none> <none> (no description available) un nvidia-340 <none> <none> (no description available) un nvidia-384 <none> <none> (no description available) un nvidia-common <none> <none> (no description available) ii nvidia-container-runti 0.9.0~beta.1+doc arm64 NVIDIA container runtime ii nvidia-container-runti 0.9.0~beta.1-1 arm64 NVIDIA container runtime hook un nvidia-cuda-dev <none> <none> (no description available) un nvidia-docker <none> <none> (no description available) rc nvidia-docker2 0.9.0~beta.1+doc all nvidia-docker CLI wrapper ii nvidia-l4t-3d-core 32.2.0-201907161 arm64 NVIDIA GL EGL Package ii nvidia-l4t-apt-source 32.2.0-201907161 arm64 NVIDIA L4T apt source list debian package ii nvidia-l4t-bootloader 32.2.0-201907161 arm64 NVIDIA Bootloader Package ii nvidia-l4t-camera 32.2.0-201907161 arm64 NVIDIA Camera Package ii nvidia-l4t-ccp-t210ref 32.2.0-201907161 arm64 NVIDIA Compatibility Checking Package ii nvidia-l4t-configs 32.2.0-201907161 arm64 NVIDIA configs debian package ii nvidia-l4t-core 32.2.0-201907161 arm64 NVIDIA Core Package ii nvidia-l4t-cuda 32.2.0-201907161 arm64 NVIDIA CUDA Package ii nvidia-l4t-firmware 32.2.0-201907161 arm64 NVIDIA Firmware Package ii nvidia-l4t-graphics-de 32.2.0-201907161 arm64 NVIDIA graphics demo applications ii nvidia-l4t-gstreamer 32.2.0-201907161 arm64 NVIDIA GST Application files ii nvidia-l4t-init 32.2.0-201907161 arm64 NVIDIA Init debian package ii nvidia-l4t-kernel 4.9.140-tegra-32 arm64 NVIDIA Kernel Package ii nvidia-l4t-kernel-dtbs 4.9.140-tegra-32 arm64 NVIDIA Kernel DTB Package ii nvidia-l4t-kernel-head 4.9.140-tegra-32 arm64 NVIDIA Linux Tegra Kernel Headers Package ii nvidia-l4t-multimedia 32.2.0-201907161 arm64 NVIDIA Multimedia Package ii nvidia-l4t-multimedia- 32.2.0-201907161 arm64 NVIDIA Multimedia Package ii nvidia-l4t-oem-config 32.2.0-201907161 arm64 NVIDIA OEM-Config Package ii nvidia-l4t-tools 32.2.0-201907161 arm64 NVIDIA Public Test Tools Package ii nvidia-l4t-wayland 32.2.0-201907161 arm64 NVIDIA Wayland Package ii nvidia-l4t-weston 32.2.0-201907161 arm64 NVIDIA Weston Package ii nvidia-l4t-x11 32.2.0-201907161 arm64 NVIDIA X11 Package ii nvidia-l4t-xusb-firmwa 32.2.0-201907161 arm64 NVIDIA USB Firmware Package un nvidia-libopencl1-dev <none> <none> (no description available) un nvidia-prime <none> <none> (no description available) vdups@jetsonk3sgpu01:~$
-
[ ] NVIDIA container library version from
nvidia-container-cli -V
-
[ ] NVIDIA container library logs (see troubleshooting)
Creating a Dockerfile with amd64 golang dependancies passes the docker build phase...
vdups@jetsonk3sgpu01:~$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
Cloning into 'k8s-device-plugin'...
remote: Enumerating objects: 1272, done.
remote: Total 1272 (delta 0), reused 0 (delta 0), pack-reused 1272
Receiving objects: 100% (1272/1272), 2.11 MiB | 1.59 MiB/s, done.
Resolving deltas: 100% (465/465), done.
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ git checkout 1.0.0-beta
Note: checking out '1.0.0-beta'.
You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example:
git checkout -b
HEAD is now at 9fa83cd Reversion the project
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ ls -l Dockerfile
lrwxrwxrwx 1 vdups vdups 35 sept. 4 21:52 Dockerfile -> docker/ubuntu16.04/amd64/Dockerfile
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ cp -pr docker/ubuntu16.04/{amd64,arm64}
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ sed -i 's;amd64;arm64;g' docker/ubuntu16.04/arm64/Dockerfile
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ vim !$
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ rm Dockerfile
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ ln -s docker/ubuntu16.04/arm64/Dockerfile Dockerfile
vdups@jetsonk3sgpu01:~/k8s-device-plugin$ docker build -t nvidia/k8s-device-plugin:1.0.0-beta .
Sending build context to Docker daemon 14.27MB
Step 1/14 : FROM ubuntu:16.04 as build
---> b63658c0b8e9
Step 2/14 : RUN apt-get update && apt-get install -y --no-install-recommends g++ ca-certificates wget && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 0c6b75184c36
Step 3/14 : ENV GOLANG_VERSION 1.10.3
---> Using cache
---> 2960d5abc56c
Step 4/14 : RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-arm64.tar.gz | tar -C /usr/local -xz
---> Running in ad5690dc041a
2019-09-04 19:55:49 URL:https://storage.googleapis.com/golang/go1.10.3.linux-arm64.tar.gz [115054972/115054972] -> "-" [1]
Removing intermediate container ad5690dc041a
---> 28db46d7b229
Step 5/14 : ENV GOPATH /go
---> Running in 9930b131f814
Removing intermediate container 9930b131f814
---> 7e4217da1590
Step 6/14 : ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH
---> Running in 14103ad5544d
Removing intermediate container 14103ad5544d
---> 9570d3b2caa8
Step 7/14 : WORKDIR /go/src/nvidia-device-plugin
---> Running in 854d4b9a30c8
Removing intermediate container 854d4b9a30c8
---> a19ee8056aef
Step 8/14 : COPY . .
---> 924dadf0cd6c
Step 9/14 : RUN export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && go install -ldflags="-s -w" -v nvidia-device-plugin
---> Running in 47c71461a642
nvidia-device-plugin/vendor/google.golang.org/grpc/resolver
nvidia-device-plugin/vendor/google.golang.org/grpc/internal
nvidia-device-plugin/vendor/golang.org/x/text/transform
nvidia-device-plugin/vendor/github.com/gogo/protobuf/sortkeys
nvidia-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml
nvidia-device-plugin/vendor/golang.org/x/sys/unix
nvidia-device-plugin/vendor/golang.org/x/net/context
nvidia-device-plugin/vendor/github.com/golang/protobuf/proto
nvidia-device-plugin/vendor/golang.org/x/net/http2/hpack
nvidia-device-plugin/vendor/golang.org/x/text/unicode/bidi
nvidia-device-plugin/vendor/github.com/fsnotify/fsnotify
nvidia-device-plugin/vendor/golang.org/x/text/secure/bidirule
nvidia-device-plugin/vendor/golang.org/x/text/unicode/norm
nvidia-device-plugin/vendor/golang.org/x/net/internal/timeseries
nvidia-device-plugin/vendor/golang.org/x/net/trace
nvidia-device-plugin/vendor/google.golang.org/grpc/grpclog
nvidia-device-plugin/vendor/google.golang.org/grpc/connectivity
nvidia-device-plugin/vendor/google.golang.org/grpc/credentials
nvidia-device-plugin/vendor/google.golang.org/grpc/balancer
nvidia-device-plugin/vendor/google.golang.org/grpc/codes
nvidia-device-plugin/vendor/google.golang.org/grpc/grpclb/grpc_lb_v1/messages
nvidia-device-plugin/vendor/golang.org/x/net/idna
nvidia-device-plugin/vendor/google.golang.org/grpc/keepalive
nvidia-device-plugin/vendor/google.golang.org/grpc/metadata
nvidia-device-plugin/vendor/google.golang.org/grpc/naming
nvidia-device-plugin/vendor/google.golang.org/grpc/peer
nvidia-device-plugin/vendor/google.golang.org/grpc/stats
nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes/any
nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes/duration
nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes/timestamp
nvidia-device-plugin/vendor/google.golang.org/genproto/googleapis/rpc/status
nvidia-device-plugin/vendor/google.golang.org/grpc/tap
nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes
nvidia-device-plugin/vendor/github.com/gogo/protobuf/proto
nvidia-device-plugin/vendor/golang.org/x/net/lex/httplex
nvidia-device-plugin/vendor/google.golang.org/grpc/status
nvidia-device-plugin/vendor/golang.org/x/net/http2
nvidia-device-plugin/vendor/github.com/gogo/protobuf/protoc-gen-gogo/descriptor
nvidia-device-plugin/vendor/google.golang.org/grpc/transport
nvidia-device-plugin/vendor/github.com/gogo/protobuf/gogoproto
nvidia-device-plugin/vendor/google.golang.org/grpc
nvidia-device-plugin/vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1
nvidia-device-plugin
Removing intermediate container 47c71461a642
---> ffea00c1e115
Step 10/14 : FROM debian:stretch-slim
stretch-slim: Pulling from library/debian
466df22dd688: Pull complete
Digest: sha256:21bdee09aac385973b3568feaf91c12bac8a9852caa04067ba3707dcd68b70e6
Status: Downloaded newer image for debian:stretch-slim
---> bbb1aa3b5816
Step 11/14 : ENV NVIDIA_VISIBLE_DEVICES=all
---> Running in 4259d03a0f5c
Removing intermediate container 4259d03a0f5c
---> 9b217ff5c577
Step 12/14 : ENV NVIDIA_DRIVER_CAPABILITIES=utility
---> Running in 30fea7d6032c
Removing intermediate container 30fea7d6032c
---> b6463120ce9d
Step 13/14 : COPY --from=build /go/bin/nvidia-device-plugin /usr/bin/nvidia-device-plugin
---> f7adcc4ae781
Step 14/14 : CMD ["nvidia-device-plugin"]
---> Running in 625841bea609
Removing intermediate container 625841bea609
---> 601e2a73f23e
Successfully built 601e2a73f23e
Successfully tagged nvidia/k8s-device-plugin:1.0.0-beta
Testing image built in previous comment : NVML library is fired when applying nvidia-device-plugin.yaml , Jetson Nano GPU is not seen and node capacity does not include "nvidia.com/gpu: 1"
Any idea how to resolve this ?
vdups@jetsonk3sgpu01:~$ sudo ${k3skubectl} logs -n kube-system pod/nvidia-device-plugin-daemonset-4gzgr
2019/09/04 20:04:18 Loading NVML
2019/09/04 20:04:18 Failed to initialize NVML: could not load NVML library.
2019/09/04 20:04:18 If this is a GPU node, did you set the docker default runtime to nvidia
?
2019/09/04 20:04:18 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/04 20:04:18 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I met the exact same issue. And wish Nvidia would add this feature soon.
Hey, thank you for your report. We will work on it, but it is not a priority for the team currently.
Thanks for your feedback @Ethyling . Do you think it would be possible to find a workaround without Nvidia's "internal knowledge" ? (I'm not a GO developer & did not review your code)
The first step could be to check that nvml is working on the host. You can use this to test it: https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/bindings/go/samples/nvml.
Merci
deviceInfo fails (as other samples from this repo)
vdups@jetsonk3sgpu01:~/go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ./deviceInfo ./deviceInfo: symbol lookup error: ./deviceInfo: undefined symbol: nvmlDeviceGetCount_v2
Header issue ?
I have an nvml Deb package on my Jetson Nano
vdups@jetsonk3sgpu01:~$ dpkg -l | grep nvml ii cuda-nvml-dev-10-0 10.0.326-1 arm64 NVML native dev links, headers
Including some headers
vdups@jetsonk3sgpu01:~$ sudo dpkg -L cuda-nvml-dev-10-0 [sudo] password for vdups: /. /usr /usr/local /usr/local/cuda-10.0 /usr/local/cuda-10.0/nvml /usr/local/cuda-10.0/nvml/example /usr/local/cuda-10.0/nvml/example/example.c /usr/local/cuda-10.0/nvml/example/supportedVgpus.c /usr/local/cuda-10.0/nvml/example/Makefile /usr/local/cuda-10.0/nvml/example/README.txt /usr/local/cuda-10.0/targets /usr/local/cuda-10.0/targets/aarch64-linux /usr/local/cuda-10.0/targets/aarch64-linux/lib /usr/local/cuda-10.0/targets/aarch64-linux/lib/stubs /usr/local/cuda-10.0/targets/aarch64-linux/lib/stubs/libnvidia-ml.so /usr/local/cuda-10.0/targets/aarch64-linux/include /usr/local/cuda-10.0/targets/aarch64-linux/include/nvml.h /usr/share /usr/share/doc /usr/share/doc/cuda-nvml-dev-10-0 /usr/share/doc/cuda-nvml-dev-10-0/changelog.Debian.gz /usr/share/doc/cuda-nvml-dev-10-0/copyright /usr/local/cuda-10.0/lib64
This header files have the symbol nvmlDeviceGetCount_v2 from deviceInfo sample program
And header from deb package is the same file as header from gpu-monitoring-tools
vdups@jetsonk3sgpu01:~$ diff ./go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml.h /usr/local/cuda-10.0/targets/aarch64-linux/include/nvml.h
vdups@jetsonk3sgpu01:~$ echo $? 0
I've tried to copy sample program from Deb package to build it. Patched include directory
vdups@jetsonk3sgpu01:~/cuda-10.0/nvml/example$ grep include example.c |* include, in the user documentation and internal comments to the code, *| #include <stdio.h> #include "/usr/local/cuda-10.0/targets/aarch64-linux/include/nvml.h"
make raises some warning (nvidia-smi is in the Makefile but unavailable on the nano) and fails
vdups@jetsonk3sgpu01:~$ head -n 10 /usr/local/cuda-10.0/nvml/example/Makefile ARCH := $(shell getconf LONG_BIT) OS := $(shell cat /etc/issue) RHEL_OS := $(shell cat /etc/redhat-release)
Gets Driver Branch
DRIVER_BRANCH := $(shell nvidia-smi | grep Driver | cut -f 3 -d' ' | cut -f 1 -d '.')
Location of the CUDA Toolkit
CUDA_PATH ?= "/usr/local/cuda-8.0"
vdups@jetsonk3sgpu01:~/cuda-10.0/nvml/example$ make cat: /etc/redhat-release: No such file or directory /bin/sh: 1: nvidia-smi: not found cc example.o -I ../../include -I ../include -lnvidia-ml -L /usr/lib/nvidia- -L ../lib/ -o example /usr/bin/ld: cannot find -lnvidia-ml collect2: error: ld returned 1 exit status Makefile:77: recipe for target 'example' failed make: *** [example] Error 1
Missing library nvidia-ml due to weird libs paths ?
/usr/lib/nvidia- path looks unfinished (and closest alternative is empty on my nano
vdups@jetsonk3sgpu01:~$ tree -f /usr/lib/nvidia /usr/lib/nvidia ├── /usr/lib/nvidia/license │ ├── /usr/lib/nvidia/license/nvlicense │ └── /usr/lib/nvidia/license/nvlicense-templates.sh └── /usr/lib/nvidia/pre-install
1 directory, 3 files
lib dir does not exists in Deb package There is one libnvidia-ml here : /usr/local/cuda-10.0/lib64/stubs/libnvidia-ml.so Creating a dummy ../lib/ directory, and pushing the file + Harcoding include in supportedVgpus.c allowed me to pass the build phase. Now I'm stuck because its looking for a libnvidia-ml.so.1 file at runtime.
vdups@jetsonk3sgpu01:~/cuda-10.0/nvml/example$ ./supportedVgpus ./supportedVgpus: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I think I misbehave when I try to ldconfig manually the .so file.
Thanks for your time & help @Ethyling
Oh, as I remember now, NVML is not supported on Jetson...
We (@adaptant-labs) developed a devicetree node labeller that is capable of exposing basic GPU information as node labels on the Jetson Nano. Nowhere near the level of detail that the GPU device plugin can provide, but it may be a suitable workaround for some of you: https://github.com/adaptant-labs/k8s-dt-node-labeller
Hi Folks, I got this working for Jetson boards: https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/20
Hi Folks, I got this working for Jetson boards: https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/20
How did you do it? Please share some details!
Hi @BokyLiu, I will publish an article about it this Wednesday, stay tuned!
@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/
@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/
Hi @paroque28 , Thanks a lot, i just work it out! And i star you repo.
Hi @BokyLiu, I will publish an article about it this Wednesday, stay tuned! hi paroque28: is your wr version support not only nano ? how about tx2 /xavier/ NX?
@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/
hi paroque28: is your wr version support not only nano ? how about tx2 /xavier/ NX?
@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/
thank you,it works for me. My machine is nano, kernel version 4.9.140-terga, docker version 19.03.6 Kubernetes version 1.18.6,
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.