nvidia-docker
nvidia-docker copied to clipboard
vulkan is broken
1. Issue or feature description
Vulkan appears to be broken on Enterprise Linux 8.3 x86-64 hosts, it has worked for us before, not sure what changed or when. OpenGL appears to be working fine, as the X window appears/renders such applications' GUIs as expected. Vulkan applications launch, the window briefly appears, then seg faults. I doubt it is a bug with the nvidia vulkan container since I can run opengl/vulkan applications after converting it into a singularity container.
2. Steps to reproduce the issue
- This BASH function is used to facilitate viewing the X-window on the host from the container:
# Based on: http://wiki.ros.org/docker/Tutorials/GUI
xForwardDockerRunArgs() {
XAUTH=`mktemp`
XSOCK='/tmp/.X11-unix'
xauth nlist ${DISPLAY} | sed -e 's/^..../ffff/' | xauth -f ${XAUTH} nmerge -
echo "-v ${XSOCK}:${XSOCK}:rw -v ${XAUTH}:${XAUTH}:rw -e XAUTHORITY=${XAUTH} -e DISPLAY=${DISPLAY}"
}
- Launch the nvidia vulkan container with
docker run --net=host --rm -it $(xForwardDockerRunArgs) --gpus=all nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04
- Deploy
glxgears
andvulkan-smoketest
inside container withapt-get update && apt-get install -y vulkan-utils mesa-utils
- Verify
glxgears
launches an X window with spinning gears and uses the GPU, this implies 'x-forwarding' to the host is working and OpenGL is using the GPU. Runglxgears
in the container and after the window appears runnvidia-smi
on the host. - Run
vulkan-smoketest
and watch a black window briefly appear/disappear withSegmentation fault (core dumped)
in the container's terminal.dmesg | tail
on the host reports something likesegfault at 0 ip ... sp ... error 4 in vulkan-smoketest...
For a sanity check with Singularity 3.7, convert the same image to a singularity image and run it, vulkan works fine:
$ cat vulkan.def
Bootstrap: docker
From: nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04
%post
apt-get update && apt-get install -y mesa-utils vulkan-utils && apt-get clean
$ singularity build --fakeroot vulkan.sif vulkan.def
...
$ singularity exec --nv vulkan.sif vulkan-smoketest
3. Information to attach (optional if deemed irrelevant)
-
[x] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
:
nvidia-container-cli.txt -
[x] Kernel version from
uname -a
:Linux fedorarouge 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
-
[x] Any relevant kernel output lines from
dmesg
[ 6991.274805] vulkan-smoketes[60028]: segfault at 0 ip 00005588fe864b7e sp 00007ffd3c596460 error 4 in vulkan-smoketest[5588fe84e000+2c000] [ 6991.274810] Code: 29 c8 48 c1 f8 03 48 39 c6 77 61 73 09 48 8d 04 f1 48 89 44 24 18 4c 89 ea 4c 89 e6 48 89 ef ff 15 77 57 21 00 48 8b 7c 24 10 <48> 8b 07 48 c7 83 f8 00 00 00 00 00 00 00 48 c7 83 00 01 00 00 ff
-
[x] Driver information from
nvidia-smi -a
: nvidia-smi.txt -
[x] Docker version from
docker version
docker.txt -
[x] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia_rpm.txt -
[x] NVIDIA container library version from
nvidia-container-cli -V
:
nvidia_container_cli.txt -
[x] NVIDIA container library logs (see troubleshooting): nvidia-container-toolkit.txt
-
[x] Docker command, image and tag used:
docker run --net=host --rm -it $(xForwardDockerRunArgs) --gpus=all nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04
Also tested on a PopOS 18.04 (Ubuntu 18.04 based) system with a Pascal GeForce, nvidia driver 460.67, nvidia-container 1.3.3, and docker CE 20.10.5... same error
Could this possibly be related to: https://github.com/NVIDIA/libnvidia-container/issues/134
Could this possibly be related to: NVIDIA/libnvidia-container#134
Thanks; that is unlikely the same issue, since vulkan-smoketest
works fine outside of containers (i.e. bare-metal) and it also works from inside a singularity container. The tested systems had but one GPU.
@qhaas Did you figure out a workaround?
while granting access to the XAUTHORITY file and XSOCKET as you describe it, works for OpenGL, it does not work somehow for vulkan. You have to activate the display
option for NVIDIA_DRIVER_CAPABILITIES which is not set in the nvidia/vulkan base image. driver-capabilities)
So instead of --gpus=all
use --gpus='all,"capabilities=compute,utility,graphics,display"' --env DISPLAY:$DISPLAY
.
This cost me days to find out :disappointed: