k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Docker image for Nvidia Jetson Nano

Open vdups opened this issue 5 years ago • 18 comments

Hi all, I have a feature request for using this plugin on Jetson Nano. I hope you could help, and I'd be happy to test !

1. Issue or feature description

I'd like to use Kubernetes on Nvidia Jetson Nano(s) With July 2019 release of Jetpack, nvidia-docker (in beta) works fine and is set as my default docker runtime. But it seems that nvidia/k8s-device-plugin container image is NOT distributed for ARM64/v8 architecture. (neither for the version specified in nvidia-device-plugin.yaml nor in any version on Docker hub : only x86_64 & ppc64le) https://hub.docker.com/r/nvidia/k8s-device-plugin

2. Steps to reproduce the issue

On a Jetson Nano, (with docker daeamon started and current user in docker group) vdups@jetsonk3sgpu01:~$ grep image: /tmp/nvidia-device-plugin.yaml - image: nvidia/k8s-device-plugin:1.0.0-beta vdups@jetsonk3sgpu01:~$ docker pull nvidia/k8s-device-plugin:1.0.0-beta 1.0.0-beta: **Pulling from nvidia/k8s-device-plugin no matching manifest for linux/arm64/v8 in the manifest list entries**

3. Information to attach (optional if deemed irrelevant)

Jetson Nano base image used : r322 downloaded in July 2019 from https://developer.nvidia.com/jetson-nano-sd-card-image-r322

I hope you could provide some help. I did not found similar open issues in github.

Common error checking:

  • [ ] The output of nvidia-smi -a on your host
  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [ ] The k8s-device-plugin container logs
  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • [ ] Docker version from docker version

  • [ ] Docker command, image and tag used

  • [X ] Kernel version from uname -a vdups@jetsonk3sgpu01:~$ uname -a Linux jetsonk3sgpu01 4.9.140-tegra #1 SMP PREEMPT Tue Jul 16 17:04:49 PDT 2019 aarch64 aarch64 aarch64 GNU/Linux

  • [ ] Any relevant kernel output lines from dmesg

  • [X ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*' vdups@jetsonk3sgpu01:~$ dpkg -l *nvidia* Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================-================-================-================================================= un libgldispatch0-nvidia <none> <none> (no description available) ii libnvidia-container-to 0.9.0~beta.1 arm64 NVIDIA container runtime library (command-line to ii libnvidia-container0:a 0.9.0~beta.1 arm64 NVIDIA container runtime library un nvidia-304 <none> <none> (no description available) un nvidia-340 <none> <none> (no description available) un nvidia-384 <none> <none> (no description available) un nvidia-common <none> <none> (no description available) ii nvidia-container-runti 0.9.0~beta.1+doc arm64 NVIDIA container runtime ii nvidia-container-runti 0.9.0~beta.1-1 arm64 NVIDIA container runtime hook un nvidia-cuda-dev <none> <none> (no description available) un nvidia-docker <none> <none> (no description available) rc nvidia-docker2 0.9.0~beta.1+doc all nvidia-docker CLI wrapper ii nvidia-l4t-3d-core 32.2.0-201907161 arm64 NVIDIA GL EGL Package ii nvidia-l4t-apt-source 32.2.0-201907161 arm64 NVIDIA L4T apt source list debian package ii nvidia-l4t-bootloader 32.2.0-201907161 arm64 NVIDIA Bootloader Package ii nvidia-l4t-camera 32.2.0-201907161 arm64 NVIDIA Camera Package ii nvidia-l4t-ccp-t210ref 32.2.0-201907161 arm64 NVIDIA Compatibility Checking Package ii nvidia-l4t-configs 32.2.0-201907161 arm64 NVIDIA configs debian package ii nvidia-l4t-core 32.2.0-201907161 arm64 NVIDIA Core Package ii nvidia-l4t-cuda 32.2.0-201907161 arm64 NVIDIA CUDA Package ii nvidia-l4t-firmware 32.2.0-201907161 arm64 NVIDIA Firmware Package ii nvidia-l4t-graphics-de 32.2.0-201907161 arm64 NVIDIA graphics demo applications ii nvidia-l4t-gstreamer 32.2.0-201907161 arm64 NVIDIA GST Application files ii nvidia-l4t-init 32.2.0-201907161 arm64 NVIDIA Init debian package ii nvidia-l4t-kernel 4.9.140-tegra-32 arm64 NVIDIA Kernel Package ii nvidia-l4t-kernel-dtbs 4.9.140-tegra-32 arm64 NVIDIA Kernel DTB Package ii nvidia-l4t-kernel-head 4.9.140-tegra-32 arm64 NVIDIA Linux Tegra Kernel Headers Package ii nvidia-l4t-multimedia 32.2.0-201907161 arm64 NVIDIA Multimedia Package ii nvidia-l4t-multimedia- 32.2.0-201907161 arm64 NVIDIA Multimedia Package ii nvidia-l4t-oem-config 32.2.0-201907161 arm64 NVIDIA OEM-Config Package ii nvidia-l4t-tools 32.2.0-201907161 arm64 NVIDIA Public Test Tools Package ii nvidia-l4t-wayland 32.2.0-201907161 arm64 NVIDIA Wayland Package ii nvidia-l4t-weston 32.2.0-201907161 arm64 NVIDIA Weston Package ii nvidia-l4t-x11 32.2.0-201907161 arm64 NVIDIA X11 Package ii nvidia-l4t-xusb-firmwa 32.2.0-201907161 arm64 NVIDIA USB Firmware Package un nvidia-libopencl1-dev <none> <none> (no description available) un nvidia-prime <none> <none> (no description available) vdups@jetsonk3sgpu01:~$

  • [ ] NVIDIA container library version from nvidia-container-cli -V

  • [ ] NVIDIA container library logs (see troubleshooting)

vdups avatar Sep 04 '19 19:09 vdups

Creating a Dockerfile with amd64 golang dependancies passes the docker build phase...

vdups@jetsonk3sgpu01:~$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin Cloning into 'k8s-device-plugin'... remote: Enumerating objects: 1272, done. remote: Total 1272 (delta 0), reused 0 (delta 0), pack-reused 1272 Receiving objects: 100% (1272/1272), 2.11 MiB | 1.59 MiB/s, done. Resolving deltas: 100% (465/465), done. vdups@jetsonk3sgpu01:~/k8s-device-plugin$ git checkout 1.0.0-beta Note: checking out '1.0.0-beta'.

You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example:

git checkout -b

HEAD is now at 9fa83cd Reversion the project vdups@jetsonk3sgpu01:~/k8s-device-plugin$ ls -l Dockerfile lrwxrwxrwx 1 vdups vdups 35 sept. 4 21:52 Dockerfile -> docker/ubuntu16.04/amd64/Dockerfile vdups@jetsonk3sgpu01:~/k8s-device-plugin$ cp -pr docker/ubuntu16.04/{amd64,arm64} vdups@jetsonk3sgpu01:~/k8s-device-plugin$ sed -i 's;amd64;arm64;g' docker/ubuntu16.04/arm64/Dockerfile vdups@jetsonk3sgpu01:~/k8s-device-plugin$ vim !$ vdups@jetsonk3sgpu01:~/k8s-device-plugin$ rm Dockerfile vdups@jetsonk3sgpu01:~/k8s-device-plugin$ ln -s docker/ubuntu16.04/arm64/Dockerfile Dockerfile vdups@jetsonk3sgpu01:~/k8s-device-plugin$ docker build -t nvidia/k8s-device-plugin:1.0.0-beta . Sending build context to Docker daemon 14.27MB Step 1/14 : FROM ubuntu:16.04 as build ---> b63658c0b8e9 Step 2/14 : RUN apt-get update && apt-get install -y --no-install-recommends g++ ca-certificates wget && rm -rf /var/lib/apt/lists/* ---> Using cache ---> 0c6b75184c36 Step 3/14 : ENV GOLANG_VERSION 1.10.3 ---> Using cache ---> 2960d5abc56c Step 4/14 : RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-arm64.tar.gz | tar -C /usr/local -xz ---> Running in ad5690dc041a 2019-09-04 19:55:49 URL:https://storage.googleapis.com/golang/go1.10.3.linux-arm64.tar.gz [115054972/115054972] -> "-" [1] Removing intermediate container ad5690dc041a ---> 28db46d7b229 Step 5/14 : ENV GOPATH /go ---> Running in 9930b131f814 Removing intermediate container 9930b131f814 ---> 7e4217da1590 Step 6/14 : ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH ---> Running in 14103ad5544d Removing intermediate container 14103ad5544d ---> 9570d3b2caa8 Step 7/14 : WORKDIR /go/src/nvidia-device-plugin ---> Running in 854d4b9a30c8 Removing intermediate container 854d4b9a30c8 ---> a19ee8056aef Step 8/14 : COPY . . ---> 924dadf0cd6c Step 9/14 : RUN export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && go install -ldflags="-s -w" -v nvidia-device-plugin ---> Running in 47c71461a642 nvidia-device-plugin/vendor/google.golang.org/grpc/resolver nvidia-device-plugin/vendor/google.golang.org/grpc/internal nvidia-device-plugin/vendor/golang.org/x/text/transform nvidia-device-plugin/vendor/github.com/gogo/protobuf/sortkeys nvidia-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml nvidia-device-plugin/vendor/golang.org/x/sys/unix nvidia-device-plugin/vendor/golang.org/x/net/context nvidia-device-plugin/vendor/github.com/golang/protobuf/proto nvidia-device-plugin/vendor/golang.org/x/net/http2/hpack nvidia-device-plugin/vendor/golang.org/x/text/unicode/bidi nvidia-device-plugin/vendor/github.com/fsnotify/fsnotify nvidia-device-plugin/vendor/golang.org/x/text/secure/bidirule nvidia-device-plugin/vendor/golang.org/x/text/unicode/norm nvidia-device-plugin/vendor/golang.org/x/net/internal/timeseries nvidia-device-plugin/vendor/golang.org/x/net/trace nvidia-device-plugin/vendor/google.golang.org/grpc/grpclog nvidia-device-plugin/vendor/google.golang.org/grpc/connectivity nvidia-device-plugin/vendor/google.golang.org/grpc/credentials nvidia-device-plugin/vendor/google.golang.org/grpc/balancer nvidia-device-plugin/vendor/google.golang.org/grpc/codes nvidia-device-plugin/vendor/google.golang.org/grpc/grpclb/grpc_lb_v1/messages nvidia-device-plugin/vendor/golang.org/x/net/idna nvidia-device-plugin/vendor/google.golang.org/grpc/keepalive nvidia-device-plugin/vendor/google.golang.org/grpc/metadata nvidia-device-plugin/vendor/google.golang.org/grpc/naming nvidia-device-plugin/vendor/google.golang.org/grpc/peer nvidia-device-plugin/vendor/google.golang.org/grpc/stats nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes/any nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes/duration nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes/timestamp nvidia-device-plugin/vendor/google.golang.org/genproto/googleapis/rpc/status nvidia-device-plugin/vendor/google.golang.org/grpc/tap nvidia-device-plugin/vendor/github.com/golang/protobuf/ptypes nvidia-device-plugin/vendor/github.com/gogo/protobuf/proto nvidia-device-plugin/vendor/golang.org/x/net/lex/httplex nvidia-device-plugin/vendor/google.golang.org/grpc/status nvidia-device-plugin/vendor/golang.org/x/net/http2 nvidia-device-plugin/vendor/github.com/gogo/protobuf/protoc-gen-gogo/descriptor nvidia-device-plugin/vendor/google.golang.org/grpc/transport nvidia-device-plugin/vendor/github.com/gogo/protobuf/gogoproto nvidia-device-plugin/vendor/google.golang.org/grpc nvidia-device-plugin/vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1 nvidia-device-plugin Removing intermediate container 47c71461a642 ---> ffea00c1e115 Step 10/14 : FROM debian:stretch-slim stretch-slim: Pulling from library/debian 466df22dd688: Pull complete Digest: sha256:21bdee09aac385973b3568feaf91c12bac8a9852caa04067ba3707dcd68b70e6 Status: Downloaded newer image for debian:stretch-slim ---> bbb1aa3b5816 Step 11/14 : ENV NVIDIA_VISIBLE_DEVICES=all ---> Running in 4259d03a0f5c Removing intermediate container 4259d03a0f5c ---> 9b217ff5c577 Step 12/14 : ENV NVIDIA_DRIVER_CAPABILITIES=utility ---> Running in 30fea7d6032c Removing intermediate container 30fea7d6032c ---> b6463120ce9d Step 13/14 : COPY --from=build /go/bin/nvidia-device-plugin /usr/bin/nvidia-device-plugin ---> f7adcc4ae781 Step 14/14 : CMD ["nvidia-device-plugin"] ---> Running in 625841bea609 Removing intermediate container 625841bea609 ---> 601e2a73f23e Successfully built 601e2a73f23e Successfully tagged nvidia/k8s-device-plugin:1.0.0-beta

vdups avatar Sep 04 '19 20:09 vdups

Testing image built in previous comment : NVML library is fired when applying nvidia-device-plugin.yaml , Jetson Nano GPU is not seen and node capacity does not include "nvidia.com/gpu: 1"

Any idea how to resolve this ?

vdups@jetsonk3sgpu01:~$ sudo ${k3skubectl} logs -n kube-system pod/nvidia-device-plugin-daemonset-4gzgr 2019/09/04 20:04:18 Loading NVML 2019/09/04 20:04:18 Failed to initialize NVML: could not load NVML library. 2019/09/04 20:04:18 If this is a GPU node, did you set the docker default runtime to nvidia? 2019/09/04 20:04:18 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2019/09/04 20:04:18 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

vdups avatar Sep 04 '19 20:09 vdups

I met the exact same issue. And wish Nvidia would add this feature soon.

kun-qian avatar Sep 05 '19 09:09 kun-qian

Hey, thank you for your report. We will work on it, but it is not a priority for the team currently.

jjacobelli avatar Sep 05 '19 16:09 jjacobelli

Thanks for your feedback @Ethyling . Do you think it would be possible to find a workaround without Nvidia's "internal knowledge" ? (I'm not a GO developer & did not review your code)

vdups avatar Sep 05 '19 17:09 vdups

The first step could be to check that nvml is working on the host. You can use this to test it: https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/bindings/go/samples/nvml.

jjacobelli avatar Sep 05 '19 18:09 jjacobelli

Merci

deviceInfo fails (as other samples from this repo)

vdups@jetsonk3sgpu01:~/go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ./deviceInfo ./deviceInfo: symbol lookup error: ./deviceInfo: undefined symbol: nvmlDeviceGetCount_v2

Header issue ?

I have an nvml Deb package on my Jetson Nano

vdups@jetsonk3sgpu01:~$ dpkg -l | grep nvml ii cuda-nvml-dev-10-0 10.0.326-1 arm64 NVML native dev links, headers

Including some headers

vdups@jetsonk3sgpu01:~$ sudo dpkg -L cuda-nvml-dev-10-0 [sudo] password for vdups: /. /usr /usr/local /usr/local/cuda-10.0 /usr/local/cuda-10.0/nvml /usr/local/cuda-10.0/nvml/example /usr/local/cuda-10.0/nvml/example/example.c /usr/local/cuda-10.0/nvml/example/supportedVgpus.c /usr/local/cuda-10.0/nvml/example/Makefile /usr/local/cuda-10.0/nvml/example/README.txt /usr/local/cuda-10.0/targets /usr/local/cuda-10.0/targets/aarch64-linux /usr/local/cuda-10.0/targets/aarch64-linux/lib /usr/local/cuda-10.0/targets/aarch64-linux/lib/stubs /usr/local/cuda-10.0/targets/aarch64-linux/lib/stubs/libnvidia-ml.so /usr/local/cuda-10.0/targets/aarch64-linux/include /usr/local/cuda-10.0/targets/aarch64-linux/include/nvml.h /usr/share /usr/share/doc /usr/share/doc/cuda-nvml-dev-10-0 /usr/share/doc/cuda-nvml-dev-10-0/changelog.Debian.gz /usr/share/doc/cuda-nvml-dev-10-0/copyright /usr/local/cuda-10.0/lib64

This header files have the symbol nvmlDeviceGetCount_v2 from deviceInfo sample program

And header from deb package is the same file as header from gpu-monitoring-tools

vdups@jetsonk3sgpu01:~$ diff ./go/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml.h /usr/local/cuda-10.0/targets/aarch64-linux/include/nvml.h

vdups@jetsonk3sgpu01:~$ echo $? 0

I've tried to copy sample program from Deb package to build it. Patched include directory

vdups@jetsonk3sgpu01:~/cuda-10.0/nvml/example$ grep include example.c |* include, in the user documentation and internal comments to the code, *| #include <stdio.h> #include "/usr/local/cuda-10.0/targets/aarch64-linux/include/nvml.h"

make raises some warning (nvidia-smi is in the Makefile but unavailable on the nano) and fails

vdups@jetsonk3sgpu01:~$ head -n 10 /usr/local/cuda-10.0/nvml/example/Makefile ARCH := $(shell getconf LONG_BIT) OS := $(shell cat /etc/issue) RHEL_OS := $(shell cat /etc/redhat-release)

Gets Driver Branch

DRIVER_BRANCH := $(shell nvidia-smi | grep Driver | cut -f 3 -d' ' | cut -f 1 -d '.')

Location of the CUDA Toolkit

CUDA_PATH ?= "/usr/local/cuda-8.0"

vdups@jetsonk3sgpu01:~/cuda-10.0/nvml/example$ make cat: /etc/redhat-release: No such file or directory /bin/sh: 1: nvidia-smi: not found cc example.o -I ../../include -I ../include -lnvidia-ml -L /usr/lib/nvidia- -L ../lib/ -o example /usr/bin/ld: cannot find -lnvidia-ml collect2: error: ld returned 1 exit status Makefile:77: recipe for target 'example' failed make: *** [example] Error 1

Missing library nvidia-ml due to weird libs paths ?

/usr/lib/nvidia- path looks unfinished (and closest alternative is empty on my nano

vdups@jetsonk3sgpu01:~$ tree -f /usr/lib/nvidia /usr/lib/nvidia ├── /usr/lib/nvidia/license │   ├── /usr/lib/nvidia/license/nvlicense │   └── /usr/lib/nvidia/license/nvlicense-templates.sh └── /usr/lib/nvidia/pre-install

1 directory, 3 files

lib dir does not exists in Deb package There is one libnvidia-ml here : /usr/local/cuda-10.0/lib64/stubs/libnvidia-ml.so Creating a dummy ../lib/ directory, and pushing the file + Harcoding include in supportedVgpus.c allowed me to pass the build phase. Now I'm stuck because its looking for a libnvidia-ml.so.1 file at runtime.

vdups@jetsonk3sgpu01:~/cuda-10.0/nvml/example$ ./supportedVgpus ./supportedVgpus: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

I think I misbehave when I try to ldconfig manually the .so file.

Thanks for your time & help @Ethyling

vdups avatar Sep 05 '19 20:09 vdups

Oh, as I remember now, NVML is not supported on Jetson...

jjacobelli avatar Sep 06 '19 23:09 jjacobelli

We (@adaptant-labs) developed a devicetree node labeller that is capable of exposing basic GPU information as node labels on the Jetson Nano. Nowhere near the level of detail that the GPU device plugin can provide, but it may be a suitable workaround for some of you: https://github.com/adaptant-labs/k8s-dt-node-labeller

pmundt avatar Apr 01 '20 12:04 pmundt

Hi Folks, I got this working for Jetson boards: https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/20

paroque28 avatar Apr 30 '20 01:04 paroque28

Hi Folks, I got this working for Jetson boards: https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/20

How did you do it? Please share some details!

BokyLiu avatar May 30 '20 08:05 BokyLiu

Hi @BokyLiu, I will publish an article about it this Wednesday, stay tuned!

paroque28 avatar May 30 '20 21:05 paroque28

@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/

paroque28 avatar Jun 02 '20 15:06 paroque28

@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/

Hi @paroque28 , Thanks a lot, i just work it out! And i star you repo.

BokyLiu avatar Jun 03 '20 09:06 BokyLiu

Hi @BokyLiu, I will publish an article about it this Wednesday, stay tuned! hi paroque28: is your wr version support not only nano ? how about tx2 /xavier/ NX?

beyondli avatar Aug 02 '20 03:08 beyondli

@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/

hi paroque28: is your wr version support not only nano ? how about tx2 /xavier/ NX?

beyondli avatar Aug 03 '20 01:08 beyondli

@BokyLiu , Take a look: https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/

thank you,it works for me. My machine is nano, kernel version 4.9.140-terga, docker version 19.03.6 Kubernetes version 1.18.6,

lxyustc08 avatar Aug 03 '20 02:08 lxyustc08

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Apr 25 '24 04:04 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar May 25 '24 04:05 github-actions[bot]