balena-jetson
balena-jetson copied to clipboard
docker devicerequests/nvidia support
we want to support exposing gpu resources to user containers via the new DeviceRequests API introduced in docker 19.03.x
To enable this we need to have the nvidia driver, userland driver-support libraries and libnvidia-container in the host os
~~WIP branch: https://github.com/balena-os/balena-jetson/tree/rgz/cuda_libs_test~~ Depends-on: https://github.com/balena-os/meta-balena/pull/1824 [merged :confetti_ball:]
arch call notes (internal): https://docs.google.com/document/d/1tFaDKyTsdi1TUfxfAjAAGJCfUVCwmPxIstdrYaOJ-I0 arch call item (internal): https://app.frontapp.com/open/cnv_5bqfytf
@acostach I think we will need: https://github.com/madisongh/meta-tegra/blob/master/recipes-devtools/cuda/cuda-driver_10.0.326-1.bb
which would give us the driver libs, right?
and we will also need https://github.com/madisongh/meta-tegra/tree/master/recipes-containers/libnvidia-container-tools for the docker integration
Looks like they might @robertgzr , I need to check with a yocto build with these 2 packages. It will take a bit cause the cuda packages first need to be downloaded locally with the nvidia sdk manager, these and their dependencies can't be pulled by yocto automatically. I'll get back to you..
I think we will still run out of space. I already have trouble getting the new balena-engine binary into some devices because of the size increase of the binary there. And the cuda stuff is going to come in with another at least 15mb or something like that
@robertgzr I built an image with those and it's available on dev 3d612ed56aaa2ba22cf73ba7a2021cb7 if you want to test with the patched engine.
A couple notes would be:
- libcuda appears to come from tegra-libraries, which is a package with ~130mb worth of nvidia libraries (libnv*, libnvidia*). Not sure if only some of them or all are tied together, as for instance cuda-drivers adds a depends on tegra-libraries. But if you get it to work probably we can try remove them one by one see if anything breaks.
- I increased the rootfs size to allow for lots of space, for testing with these packages and the new engine
@acostach do you have a branch on here I can use, I would like to pull in the engine using balena-os/meta-balena#1824 rather than copying the binary around...
@acostach I'm trying to figure out why it won't work out of the box...
root@3d612ed:~# nvidia-container-cli info
NVRM version: (null)
CUDA version: 10.0
Device Index: 0
Device Minor: 0
Model: NVIDIA Tegra X2
Brand: (null)
GPU UUID: (null)
Bus Location: (null)
Architecture: 6.2
root@3d612ed:~# balena run --rm -it --gpus all nvcr.io/nvidia/l4t-base:r32.3.1 bash
balena: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I feel like the info
command should return a UUID (unfortunately haven't tested this on the playground box last week)
also the container-cli can be used to ask what components of the driver are required:
root@3d612ed:~# nvidia-container-cli list
/usr/lib/libcuda.so.1.1
/usr/lib/libnvidia-ptxjitcompiler.so.32.3.1
/usr/lib/libnvidia-fatbinaryloader.so.32.3.1
/usr/lib/libnvidia-eglcore.so.32.3.1
/usr/lib/libnvidia-glcore.so.32.3.1
/usr/lib/libnvidia-tls.so.32.3.1
/usr/lib/libnvidia-glsi.so.32.3.1
/usr/lib/libGLX_nvidia.so.0
/usr/lib/libEGL_nvidia.so.0
/usr/lib/libGLESv2_nvidia.so.2
/usr/lib/libGLESv1_CM_nvidia.so.1
where can I see which version of the driver is installed on the tx2?
@robertgzr it's the 32.3.1 driver from l4t 32.3.1, if that's what you are referring to.
root@3d612ed:~# modinfo /lib/modules/4.9.140-l4t-r32.3.1/kernel/drivers/gpu/nvgpu/nvgpu.ko
filename: /lib/modules/4.9.140-l4t-r32.3.1/kernel/drivers/gpu/nvgpu/nvgpu.ko
alias: of:NTCnvidia,gv11bC*
alias: of:NTCnvidia,gv11b
alias: of:NTCnvidia,tegra186-gp10bC*
alias: of:NTCnvidia,tegra186-gp10b
alias: of:NTCnvidia,tegra210-gm20bC*
alias: of:NTCnvidia,tegra210-gm20b
depends:
intree: Y
vermagic: 4.9.140-l4t-r32.3.1 SMP preempt mod_unload modversions aarch64
Wondering if it does so because it's not initialized, due to the firmware blobs that are usually extracted to the container weren't loaded by the driver, as they aren't in the hostOS? I'm referring to (BSP Archive) Tegra186_Linux_R32.3.1.tbz2/Linux_for_tegra/nv_tegra/nvidia_drivers.tbz2/lib/firmware/tegra18x , gp10b. Not sure this is the issue but can you try to initialize it first, maybe from a container and then shut down the container but leave the driver loaded, or unpack nv_drivers directly in the hostOS.
https://github.com/balena-io-playground/tx2-container-contracts-sample/blob/16d3ad09f0615956389f04105e3b533be9620388/tx2_32_2/Dockerfile.template#L7 but use the 32.3.1 BSP archive for the TX2 from here: https://developer.nvidia.com/embedded/linux-tegra
I haven't got time just yet to look into or release a 32.3.1 based BalenaOS for the tx2, but if you are having issues with unpacking the BSP archive in the container, here's how it works for the Nano on 32.3.1: https://github.com/acostach/jetson-nano-container-contracts/blob/51e9bfa97a91692c6b806ed32c9e96e656f5b088/nano_32_3_1/Dockerfile.template#L7
I think we're fine in the driver department. it looks like docker is only loading it's internal compat layer for the nvidia stuff if nvidia-container-runtime-hook
is present on the hostOS
I'm going to see if I can find where this is supposed to come from but I think it's usually installed as part of the libnvidia-container package
@acostach ok so we need this: https://github.com/NVIDIA/nvidia-container-runtime/tree/v3.1.4/toolkit/nvidia-container-toolkit
which is provided by https://github.com/madisongh/meta-tegra/tree/master/recipes-containers/nvidia-container-toolkit
@robertgzr thanks, I've updated https://github.com/balena-os/balena-jetson/commits/cuda_libs_test with this package, let me know if it works with it
@acostach any idea why the runtime-hook is complaining about missing libraries:
root@3d612ed:~# nvidia-container-runtime-hook
nvidia-container-runtime-hook: error while loading shared libraries: libstd.so: cannot open shared object file: No such file or directory
is libstd not a rust thing?
not sure @robertgzr , I see this libstd is provided by rust in the rootfs, but probably the hook binary comes pre-compiled and was built against a different version of the library?
root@3d612ed:~# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/rust/
root@3d612ed:~# nvidia-container-runtime-hook
nvidia-container-runtime-hook: symbol lookup error: nvidia-container-runtime-hook: undefined symbol: main
the thing is there is not rust dependency? and the hook binary should be built from source by https://github.com/madisongh/meta-tegra/blob/master/recipes-containers/nvidia-container-toolkit/nvidia-container-toolkit_1.0.5.bb from https://github.com/NVIDIA/container-toolkit/tree/60f165ad6901f85b0c3acbf7ce2c66cd759c4fb8/nvidia-container-toolkit no?
something is wrong here... but I don't understand
@robertgzr It doesn't look like a rust dependency, unless I'm mistaking somewhere. And that's right, the hook binary is built from sources, but they are go sources.
So it appears there are 2 libstds, one from rust as you said, which isn't good for us, and another one from go. The go version that we currently have in the image comes from meta-balena and is at version 1.10.
I think the hook was built against some newer go 'headers', although I'm not familiar with the go workflow or build process.
root@3d612ed:~# export LD_LIBRARY_PATH=/home/root/ # this is where I copied libstd.so provided by go on the shared board
root@3d612ed:~# nvidia-container-runtime-hook
nvidia-container-runtime-hook: symbol lookup error: nvidia-container-runtime-hook: undefined symbol: runtime.arm64HasATOMICS
Looking at this: https://github.com/golang/go/blob/a1550d3ca3a6a90b8bbb610950d1b30649411243/src/cmd/internal/goobj2/builtinlist.go#L185 I see the symbol 'runtime.arm64HasATOMICS' is present starting from go version ~1.14 so I updated manually to go 1.14, updated the poky class to zeus, re-built go and nvidia-container-toolkit, then uploaded the runtime-container-toolkit and libstd.so binaries to the shared board and it appears to work:
root@3d612ed:~# export LD_LIBRARY_PATH=/home/root/
root@3d612ed:~# nvidia-container-runtime-hook
Usage of nvidia-container-runtime-hook:
-config string
configuration file
-debug
enable debug output
Please try to run it again now on the shared device, check if this unblocks.
@acostach oh ok that makes more sense now... sounds to me like something is still up with our go integration in meta-balena. if you check my pr here: https://github.com/balena-os/meta-balena/pull/1824
This provides the nvidia enabled balena-engine. part of it is a bump to go 1.12.12
sounds like the build of nvidida-container-toolkit uses a different go toolchain than that one? how is that possible? I thought we can enforce it via thr GOVERSION env from meta-balena
Looking at this: golang/go:src/cmd/internal/goobj2/builtinlist.go@a1550d3#L185 I see the symbol 'runtime.arm64HasATOMICS' is present starting from go version ~1.14
the toolkit recipe shouldn't have a dependency on any version of go btw. if it gets built by 1.12.12 it should just work.
I have actually never encountered something like that. I didn't even know the go stdlib could be loaded as shared
I looked at the documentation a little bit:
yocto compiles the go runtime (which includes the stdlib) as a shared library: http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/go/go-runtime.inc?h=zeus#n40
the go.bbclass in poky has a switch to link the recipe to that shared lib, GO_DYNLINK
:
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/classes/go.bbclass?h=zeus#n35
that is enabled for supported platforms by default I think: http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/classes/goarch.bbclass?h=zeus#n26
https://golang.org/cmd/go/#hdr-Compile_packages_and_dependencies (ctrl-f "linkshared")
I manually compiled it without those flags and now it work:
root@3d612ed:~# balena run --gpus all -it balenalib/jetson-tx2-ubuntu:bionic-run bash
root@71f92d3822b7:/#
Hi @robertgzr, @acostach. I'm happy to see progress on the subject. We have been asking for this feature for a long time. Will this be supported in BalenaOS anytime soon?
Hi @dremsol currently it's something we're considering and investigating, we don't have a timeline as the final conclusions were not reached yet.
Hi @robertgzr & @acostach,
I've taken a deeper look at this issue and I would like to share our experiences. Besides, I have a couple of questions which i hope you can answer. First of all, our custom OS indeed shows similar output
root@photon-nano:~# nvidia-container-cli info
NVRM version: (null)
CUDA version: 10.0
Device Index: 0
Device Minor: 0
Model: NVIDIA Tegra X1
Brand: (null)
GPU UUID: (null)
Bus Location: (null)
Architecture: 5.3
root@photon-nano:~# nvidia-container-cli list
/usr/lib/libcuda.so.1.1
/usr/lib/libnvidia-ptxjitcompiler.so.32.3.1
/usr/lib/libnvidia-fatbinaryloader.so.32.3.1
/usr/lib/libnvidia-eglcore.so.32.3.1
/usr/lib/libnvidia-glcore.so.32.3.1
/usr/lib/libnvidia-tls.so.32.3.1
/usr/lib/libnvidia-glsi.so.32.3.1
/usr/lib/libGLX_nvidia.so.0
/usr/lib/libEGL_nvidia.so.0
/usr/lib/libGLESv2_nvidia.so.2
/usr/lib/libGLESv1_CM_nvidia.so.1
This allows to run the CUDA samples by pulling nvcr.io/nvidia/l4t-base:r32.4.2
just fine under the assumption the CUDA libs are installed in HostOS. So far so good and probably the goal you want to achieve in this issue.
First thing i would like to point out is that once you would like to mount a CSI camera into the docker container it requires a daemon to run in HostOS (tegra-argus-daemon). Subsequently, the additional argument to add to run command for accessing CSI Camera from within container is: -v /tmp/argus_socket:/tmp/argus_socket
. Considering a USB Camera the additional argument is --device /dev/video0:/dev/video0
.
Now we are building our application using the deepstream-l4t container and this is where it get's interesting as the required HostOS packages become application dependent. Besides CUDA, it requires cuDNN and TensorRT. While this is still feasible to include somehow (either static or configurable through BalenaCloud) it becomes a mess once you need to include the application specific gstreamer plugins in HostOS. To give small snippet (not optimized);
# NVIDIA
IMAGE_INSTALL_append = " cuda-driver cuda-toolkit nvidia-container-runtime cuda-samples nvidia-docker cudnn tensorrt libvisionworks libvisionworks-sfm libvisionworks-tracking tegra-tools tegra-argus-daemon"
# gstreamer and plugings
## nvidia specific packages
IMAGE_INSTALL_append = " gstreamer1.0-omx-tegra gstreamer1.0-plugins-nveglgles gstreamer1.0-plugins-nvvideo4linux2 gstreamer1.0-plugins-nvvideosinks"
## most of these are pulled in as dependencies of the nvidia specific packages
## specify them explicitly as dependencies here to ensure they are included
## TODO: check depends and cleanup
IMAGE_INSTALL_append = " gstreamer1.0 gstreamer1.0-meta-base gstreamer1.0-plugins-base gstreamer1.0-plugins-bad"
IMAGE_INSTALL_append = " gstreamer1.0-plugins-good gstreamer1.0-python gstreamer1.0-rtsp-server gstreamer1.0-vaapi"
As our goal is clear, how do you see this fit in the Balena ecosystem? A 'one image to rule them all' approach would not work for all applications i guess.
[robertgzr] This issue has attached support thread https://jel.ly.fish/#/de9ddbf3-0b65-4cba-a2e2-38e43855f1bd
@dremsol how difficult do you think it would be to run the tegra-argus-daemon
itself in a container as well?
then you can just share the socket with your app container and it gives you full control over the dependencies too
Hi @robertgzr,
Good suggestion, didn't test that so far. We forked balena-jetson and got nvidia-container-runtime working with balena-engine. Besides CUDA, we included cuDNN, TensorRT, and Visionworks (jetson-nano) as required by NGC l4t containers with some minor changes in nvidia-container-runtime (runc vs. balena-runc).
root@balena:~# balena run -it --rm --net=host --runtime nvidia nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-base
Unable to find image 'nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-base' locally
4.0.2-19.12-base: Pulling from nvidia/deepstream-l4t
8aaa03d29a6e: Pull complete
......
bcac47627c16: Pull complete
Total: [==================================================>] 559.6MB/559.6MB
Digest: sha256:58c0e19332824da544b72c5eae063d1f1a0ea876af76a8e519dd71aeb023d1de
Status: Downloaded newer image for nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-base
root@balena:~#
Depending on the application, the following packages may be installed on HostOS where the container-runtime-csv
bbclass makes the appropriate nvidia runtime links.
./external/openembedded-layer/recipes-multimedia/v4l2apps/v4l-utils_%.bbappend:inherit container-runtime-csv
./recipes-devtools/visionworks/libvisionworks-sfm_0.90.4.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/visionworks/libvisionworks_1.6.0.500n.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/visionworks/libvisionworks-tracking_0.88.2.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/gie/tensorrt_6.0.1-1.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/cudnn/cudnn_7.6.3.28-1.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/cuda/cuda-shared-binaries-10.0.326-1.inc:inherit container-runtime-csv
./recipes-devtools/cuda/cuda-cudart_10.0.326-1.bb:inherit container-runtime-csv siteinfo
./recipes-bsp/tegra-binaries/gstreamer1.0-plugins-tegra_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/tegra-libraries_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/tegra-firmware_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/libdrm-nvdc_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/tegra-nvphs-base_32.3.1.bb:inherit container-runtime-csv
./recipes-multimedia/libv4l2/libv4l2-minimal_1.18.0.bb:inherit autotools gettext pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nvjpeg_1.14.0-r32.3.1.bb:inherit autotools gtk-doc gettext pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-omx-tegra_1.0.0-r32.3.1.bb:inherit autotools pkgconfig gettext container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nveglgles_1.2.3-r32.3.1.bb:inherit autotools gettext gobject-introspection pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nvvideo4linux2_1.14.0-r32.3.1.bb:inherit gettext pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nvvideosinks_1.14.0-r32.3.1.bb:inherit gettext pkgconfig container-runtime-csv
As nvidia-container-runtime expects JetPack as HostOS it's not yet clear to me which packages are really necessary besides the ones allready included. Anyway, we had to include the nvidia specific gstreamer packages in our custom OS to get our application running within deepstream container. ~~I've tried to include them with Balena but didn't succeed so far as balena-jetson depends on warrior (vs zeus in meta-tegra to support nvidia-container-runtime).~~
➜ resin-image git:(master) cat installed-package-sizes.txt | head -n 10
436914 KiB libcudnn7
218999 KiB tensorrt
106699 KiB tegra-libraries
91930 KiB cuda-cublas
52629 KiB balena
39126 KiB kernel-image-initramfs
35853 KiB go-runtime
27149 KiB libvisionworks
Hi @robertgzr & @acostach,
Had a good talk with Joe today and he asked me to keep you updated. It seems that the nvidia runtime is working nicely with balena-engine and the Host packages are being mapped accordingly by using the mount plugin.
Running the deviceQuery sample returns a PASS;
root@balena:/tmp/deviceQuery# balena run -it --runtime nvidia devicequery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3962 MBytes (4154109952 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
@dremsol this sounds amazing. with balena-engine 19.03 finally merged in it's main repo we're one step closer to making all of this happen in vanilla balenaOS. I have lifted the meta-balena PR out of draft status here: balena-os/meta-balena#1824 and it's under review right now. Once we merge the new engine there, work on this issue should pick up again.
You're using nvidi-container-runtime (the previous iteration of gpu support) while I mostly tried to make this work through https://github.com/NVIDIA/container-toolkit/ which is the approach that docker "blessed"
I don't see why the mount plugin work shouldn't be possible there, as long as libnvidia-container has the changes on it's jetson
branch
Hi @robertgzr, we are very happy to hear that GPU support is moving to production. We will keep an eye on the PR.
Thanks for the suggestion and it seems you are right. It's a bit hard to follow the footsteps of NVIDIA sometimes but we managed to drop the dependency on runtime. However, this also drops the inclusion of l4t.csv
but has been solved in nvidia-container-toolkit as libnvidia-container parses .csv files.
- Why is NVIDIA referring to
--runtime nvidia
everywhere as this is obsolete?
Based on the work of @acostach in jetson-nano-sample-app we would like to run all cuda samples in a striped down version of the Dockerfile to test the --gpus all
flag and the plugin mounts. We managed to get ./clock and ./deviceQuery working. However for the remaining samples involving OpenGL we stumble upon some errors related to OpenGL after building and firing the container as follows;
balena build -t cudasamples -f Dockerfile.cudesamples .
balena run -it --rm --privileged --gpus all cudasamples bash
And setting DISPLAY and running X
$ export DISPLAY=:0
$ X &
$ ./clock <PASS>
$ ./deviceQuery <PASS>
$ ./postProcessGL <FAIL>
$ ./simpleGL <FAIL>
$ ./simpleTexture3D <FAIL>
$ ./smokeParticles <FAIL>
Failed sample outputs look like;
simpleTexture3D Starting...
GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3
CUDA error at simpleTexture3D.cpp:247 code=30(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard)"
simpleGL (VBO) starting...
GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3
CUDA error at simpleGL.cu:422 code=30(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(vbo_res, *vbo, vbo_res_flags)"
CUDA error at simpleGL.cu:434 code=33(cudaErrorInvalidResourceHandle) "cudaGraphicsUnregisterResource(vbo_res)"
./postProcessGL Starting...
(Interactive OpenGL Demo)
GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3
CUDA error at main.cpp:243 code=30(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(pbo_resource, *pbo, cudaGraphicsMapFlagsNone)"
CUDA Smoke Particles Starting...
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
The following required OpenGL extensions missing:
GL_ARB_multitexture
GL_ARB_vertex_buffer_object
GL_EXT_geometry_shader4.
@acostach, have you seen these errors before? It seems like it has something to do with X but I cant figure out the cause. dmesg is showing plugging and unplugging HDMI and when running the samples the display is blinking briefly before crash. Do you have a clue?
@robertgzr i think i answered my own question as it seems compose doesn't support the --gpus all
flag yet as seen in following issue.
Anyway it shouldn't be a problem to install runtime (--runtime nvidia
) alongside toolkit as both flags will probably work.
@dremsol
Why is NVIDIA referring to --runtime nvidia everywhere as this is obsolete?
I know, that has been a major pain when researching this topic. I guess plenty of people out there are still using the old approaches... but there are just so many repos that claim to be the one and the container-toolkit for example doesn't even come with a README but is essential for the whole thing to work.
You are right upstream composefile support isn't progressing much: https://github.com/docker/compose/pull/7124
but that will not really be an issue I hope because you can basically communicate the same using env vars, check out their base images here
@acostach should we try to cut down the set of commits on the wip branch? We should only need to unmask the cuda recipe, include the container-toolkit and bump meta-balena no? I guess the rootfs size needs to be investigated but I would leave this until the very end...