toolbox Enable the proprietary NVIDIA driver

trafficstars

First, great project!

If I'm using Nvidia proprietary driver, OpenGL softwares (like Blender) don't work inside toolbox container. I tried to install the proprietary driver inside the container, it installs but the OpenGL softwares don't work. Is it necessary to install more things? Or set some env variable?

Thanks!

Apr 16 '19 01:04 tfmoraes

~~Toolbox is a container, you would have to map your graphics card inside, or do things the way nvidia-docker does.~~

The reply further down https://github.com/containers/toolbox/issues/116#issuecomment-494447743 works perfectly.

May 06 '19 14:05 Findarato

@Findarato You mean add something like --volume /dev/nvidia0:/dev/nvidia0 and other /dev files?

May 06 '19 17:05 tfmoraes

So to have the NVIDIA stuff working inside the Toolbox I had to do this (inspired by https://github.com/thewtex/docker-opengl-nvidia):

You have to patch the Toolbox to bind mount the /dev/nvidia0 and /dev/nvidiactl to the Toolbox and setup the X11 things - see https://github.com/tpopela/toolbox/commit/40231e8591d70065199c0df9b6811c2f9e9d7269
Download the NVIDIA proprietary drivers on the host:

#!/bin/sh

# Get your current host nvidia driver version, e.g. 340.24
nvidia_version=$(cat /proc/driver/nvidia/version | head -n 1 | awk '{ print $8 }')

# We must use the same driver in the image as on the host
if test ! -f nvidia-driver.run; then
  nvidia_driver_uri=http://us.download.nvidia.com/XFree86/Linux-x86_64/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
  wget -O ~/nvidia-driver.run $nvidia_driver_uri
fi

Install the drivers while being inside the Toolbox:

#!/bin/sh

sudo dnf install -y glx-utils kmod libglvnd-devel || exit 1
sudo sh ~/nvidia-driver.run -a -N --ui=none --no-kernel-module || exit 1
glxinfo | grep "OpenGL version"

May 21 '19 15:05 tpopela

@tpopela it worked. Thanks!

May 22 '19 17:05 tfmoraes

I'm glad it worked! But there was a mistake that could lead to malfunctions after the host is restarted - you will need to apply https://github.com/tpopela/toolbox/commit/3db450a8e5762399fd81c848f311da950437dd04 on top of the previous patch.

May 23 '19 05:05 tpopela

@tpopela We might be able to get away without bind mounting /tmp/.x11-unix. These days the X.org server listens on an abstract UNIX socket and a UNIX socket on the file system. The former doesn't work if you have a network namespace, but the Toolbox container doesn't have one (because podman create --net host), and that's why X applications work. The latter is located at /tmp/.x11-unix and is used by Flatpak containers because those have network namespaces.

References:

https://github.com/flatpak/flatpak/issues/938
http://man7.org/linux/man-pages/man7/unix.7.html

May 23 '19 17:05 debarshiray

Ah ok @debarshiray! Thank you for clarification. I can confirm that not bind mounting the /tmp/X11-unix doesn't change anything and the integration works (tried to run Blender here).

There is maybe a small change after we are bind mounting the whole /dev. Blender now looks for nvcc (CUDA stuff) in PATH and can't find it.

May 24 '19 05:05 tpopela

With the merge of https://github.com/debarshiray/toolbox/pull/119 this issue may be closed, since Nvidia is working now with proprietary driver. It's just necessary to install nvidia driver once inside the toolbox container. @tpopela's scripts helps with driver installation. @tpopela you have to install CUDA Toolkit. To make it install I've passed the parameters --override and --toolkit. After installing CUDA Toolkit Blender show me option to render using CUDA. But unfortunately CUDA doesn't work with GCC9 :(

May 26 '19 23:05 tfmoraes

Actually I would leave this open (but I will leave it on Rishi) as we were thinking with @debarshiray about leaking the NVIDIA host drivers to the container, so there will be no need to manually install the drivers in the container. We have a working WIP solution for it.

May 27 '19 04:05 tpopela

That would be great!

May 27 '19 12:05 tfmoraes

we were thinking with @debarshiray about leaking the NVIDIA host drivers to the container, so there will be no need to manually install the drivers in the container.

Yes, I agree that this will be the right thing to do. OpenGL drivers have a kernel module and some user-space components (eg., shared libraries) that talk to each other. In NVIDIA's case the interface between these two components isn't stable and hence the user-space bits inside the container must match the kernel module on the host. These two can go out of sync if your host is lagging behind the container or vice versa.

The problem with leaking the files into the container is maintaining a list of those files somewhere because they vary from version to version. This would be vastly simpler if there was a well known nvidia directory somewhere on the host that could be bind mounted because then we wouldn't have to worry about the names and locations of the individual files themselves. Unfortunately that's not the case.

Looking around, I found Flatpak's solution to be a reasonable compromise. In short, it invents and enforces this well known nvidia directory. It expects distributors of the host OS to put all the user-space files in /var/lib/flatpak/extension/org.freedesktop.Platform.GL.host/x86_64/1.4 and that's implemented by modifying the package shipping the NVIDIA driver.

With that done, we'd need to figure out where to place these files inside the container and how to point the container's runtime environment at them.

Jun 06 '19 17:06 debarshiray

Nvidia have their own solution for this nvidia-container-runtime-hook which works very well with podman triggered by an oci prestart hook. I just run into an issues at the moment when using --uidmaps resulting in losing permissions to run ldconfig:

could not start /sbin/ldconfig: mount operation failed: /proc: operation not permitted

It may be better for toolbox to try and integrate with this existing tool rather then maintaining another implementation.

Jun 19 '19 07:06 garyedwards

Issue relating to the uidmap permission problem:

https://github.com/NVIDIA/libnvidia-container/issues/49

Jun 19 '19 09:06 garyedwards

I was trying to run steam in the toolbox bug #343 I didn't patch the toolbox, steam runs and opengl works but vulkan doesn't seem to work, tried vkmark and Rise of Tomb Raider on steam.

Any ideas how to get it to work?

Nov 28 '19 13:11 andreldmonteiro

I saw that Singularity ccontainer fix this problem without libnvidia-container. They use a list of needed files

Aug 01 '20 22:08 tfmoraes

So what is the status of using Nvidia GPU drivers in container in 2021? I can /dev/nvidia0 and /dev/nvidiactl are mounted. However, I cannot install Nvidia drivers successfully. The install proceeds normally but checking with modinfo -F version nvidia gives Error: modinfo: ERROR: Module alias nvidia not found.. And Nvidia Container Toolkit is not officially supported in Fedora, so it doesn't seem like a good idea to use with Fedora Silverblue.

Jun 19 '21 17:06 Ayush1325

The latest version of toolbox (0.0.99.3) exposes the host filesystem at /run/host. I believe it should be possible to create a Containerfile something like this:

FROM registry.fedoraproject.org/fedora-toolbox:35

RUN ln -s /run/host/usr/share/vulkan/icd.d/nvidia_icd.json /usr/share/vulkan/icd.d/nvidia_icd.json && \
    ln -s /run/host/usr/lib64/libGLX_nvidia.so.0 /usr/lib64/libGLX_nvidia.so.0

To expose the host userspace driver to the container. I don't have an Nvidia machine to test at the moment, but I assume that would do it? The above example should hopefully work for Vulkan, I'm not exactly sure if some extra file would need to be linked for OpenGL

Jan 15 '22 03:01 loganmc10

Ok, so with the latest toolbox, I can install nvidia drivers fine. On running nvidia-smi I gett the correct output as well. However, modinfo -F version nvidia command doesn't seem to work so not sure if the drivers are actually working.

Feb 23 '22 07:02 Ayush1325

So do you mean reinstall the nvidia driver inside the container is to fix the ldconfig? I remember there is a step to rerun ldconfig

Reference: https://docs.01.org/clearlinux/latest/zh_CN/tutorials/nvidia.html#configure-alternative-software-paths

Apr 19 '22 12:04 whs-dot-hk

So to have the NVIDIA stuff working inside the Toolbox I had to do this (inspired by https://github.com/thewtex/docker-opengl-nvidia):

1. You have to patch the Toolbox to bind mount the /dev/nvidia0 and /dev/nvidiactl to the Toolbox and setup the X11 things - see [tpopela@40231e8](https://github.com/tpopela/toolbox/commit/40231e8591d70065199c0df9b6811c2f9e9d7269)

2. Download the NVIDIA proprietary drivers on the host:

#!/bin/sh

# Get your current host nvidia driver version, e.g. 340.24
nvidia_version=$(cat /proc/driver/nvidia/version | head -n 1 | awk '{ print $8 }')

# We must use the same driver in the image as on the host
if test ! -f nvidia-driver.run; then
  nvidia_driver_uri=http://us.download.nvidia.com/XFree86/Linux-x86_64/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
  wget -O ~/nvidia-driver.run $nvidia_driver_uri
fi

3. Install the drivers while being inside the Toolbox:

#!/bin/sh

sudo dnf install -y glx-utils kmod libglvnd-devel || exit 1
sudo sh ~/nvidia-driver.run -a -N --ui=none --no-kernel-module || exit 1
glxinfo | grep "OpenGL version"

Just adding this worked for me too. I hope with the OSS version of their driver it will just work out of the box like all the AMD cards do.

Jun 23 '22 19:06 Findarato

Ok, so with the latest toolbox, I can install nvidia drivers fine. On running nvidia-smi I gett the correct output as well. However, modinfo -F version nvidia command doesn't seem to work so not sure if the drivers are actually working.

@Ayush1325 yes, the drivers are working as I can compile with nvcc. yes, modinfo -F version nvidia does not work within the container.

I used the nvidia fedora 35 repo (nvidia-driver and cuda) for both the host (F37) and container (F35; matching gcc version). Beyond that, I added the nvidia bin folder to the path and set the $LD_LIBRARY_PATH for each install.

Nov 27 '22 08:11 3dsf

What needs to be done for this?

If you don't care about having to have users install nvidia-container-toolkit:

 podman run --rm -it --privileged --security-opt=label=disable -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu

If you want something entirely independent you can mount the relevant nvidia driver files into the container in a manner similar to:

distrobox: https://github.com/89luca89/distrobox/pull/658/files
singularity: https://github.com/apptainer/singularity/blob/master/etc/nvliblist.conf

I don't think installing the nvidia driver inside the container is a sustainable solution because host/container should match.

I personally feel option 1 is more sustainable, but it's pretty simple (two appended environmental variables and a host executable check for nvidia-container-toolkit, would a PR for one of these options be accepted @debarshiray or should this be documented?

Mar 25 '23 20:03 mjlbach

What needs to be done for this?

[...]

would a PR for one of these options be accepted @debarshiray or should this be documented?

Did you see my comment above? Unless there's a problem with it, I still prefer the unmanaged Flatpak extension option.

I finally got myself some NVIDIA hardware to play with this.

I see that the Container Device Interface requires installing the NVIDIA Container Toolkit.

As far as I can make out, the nvidia-container-toolkit or nvidia-container-toolkit-base packages are only available from NVIDIA's own repositories right now. For example, I am on Fedora 39, and even though they are supposed to be free software, I see them neither in Fedora proper nor RPMFusion, but RPMFusion does have NVIDIA's proprietary driver.

Is there anything else other than NVIDIA that uses the Container Device Interface?

I would like to understand the situation a bit better. Ultimately I want to make it as smooth as possible for the user to enable the NVIDIA proprietary driver. That becomes a problem if one needs to enable multiple different unofficial repositories, at least on Fedora.

I will start by reviving the pull request from @TingPing against negativo17's RPM for the proprietary NVIDIA driver, but against RPMFusion, because that's the implementation Fedora Workstation promotes these days. If nothing else, it will immediately help Flatpak because those containers will always have access to the driver. We can add the same plumbing to Toolbx and benefit similarly.

Mar 28 '24 22:03 debarshiray

Thanks to @Jmennius and @owtaylor I changed my mind about how to enable the proprietary NVIDIA driver in Toolbx containers. Since Intel, NVIDIA and several container tools, including Podman, have embraced the Container Device Interface, it's a better path to take than the unmanaged Flatpak extension approach that I had mentioned before.

However, we need to be a bit careful when using the CDI. The way it's widely advertised requires root privileges, because podman run --device nvidia.com/gpu... expects the CDI file to be present in either /etc/cdi or /var/run/cdi. It's not possible to create the file with nvidia-ctk cdi generate and put it in those locations without root access. It will be good if we make it work entirely rootless.

So, I chose to use the Go packages from tags.cncf.io/container-device-interface and github.com/NVIDIA/nvidia-container-toolkit to create the Container Device Interface file ourselves during enter and run, make it available to init-container, and let it parse and apply it when the container starts. The CDI file is ultimately a bunch of environment variables, bind mounts and hooks to call ldconfig(8), so it's not that hard. Since Toolbx already makes the entire /dev from the host available to the container, we don't need to worry about the devices.

This avoids the need for root privileges, and has the extra benefit of enabling the driver in existing Toolbx containers.

I have just now merged an implementation using this approach through https://github.com/containers/toolbox/pull/1497 that seems to work with the NVIDIA Quadro P600 GPU on my ThinkPad P72 laptop. I have tested it with Arch Linux, Fedora, RHEL and Ubuntu containers on Fedora hosts. However, I haven't been able to test any non-Fedora host. Please feel free to open issues or send pull requests if you notice anything wrong.

Jun 12 '24 18:06 debarshiray

Nice. I was looking for something like this. Is there a documentation for this? Do I specify nvidia.com/gpu=all or something similar during toolbox create?

Jul 01 '24 21:07 gnufied

Nice. I was looking for something like this. Is there a documentation for this? Do I specify nvidia.com/gpu=all or something similar during toolbox create?

No, nothing. :)

Build Toolbx from Git main, stop all running containers, and start using the new Toolbx. If you have the proprietary NVIDIA driver installed on the host, then both existing and new containers will pick it up.

Aug 01 '24 21:08 debarshiray

toolbox toolbox copied to clipboard

Enable the proprietary NVIDIA driver

toolbox
toolbox copied to clipboard