nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

vulkan is broken

Open qhaas opened this issue 3 years ago • 5 comments

1. Issue or feature description

Vulkan appears to be broken on Enterprise Linux 8.3 x86-64 hosts, it has worked for us before, not sure what changed or when. OpenGL appears to be working fine, as the X window appears/renders such applications' GUIs as expected. Vulkan applications launch, the window briefly appears, then seg faults. I doubt it is a bug with the nvidia vulkan container since I can run opengl/vulkan applications after converting it into a singularity container.

2. Steps to reproduce the issue

  1. This BASH function is used to facilitate viewing the X-window on the host from the container:
# Based on: http://wiki.ros.org/docker/Tutorials/GUI
xForwardDockerRunArgs() {
 XAUTH=`mktemp`
 XSOCK='/tmp/.X11-unix'
 xauth nlist ${DISPLAY} | sed -e 's/^..../ffff/' | xauth -f ${XAUTH} nmerge -
 echo "-v ${XSOCK}:${XSOCK}:rw -v ${XAUTH}:${XAUTH}:rw -e XAUTHORITY=${XAUTH} -e DISPLAY=${DISPLAY}"
}
  1. Launch the nvidia vulkan container with docker run --net=host --rm -it $(xForwardDockerRunArgs) --gpus=all nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04
  2. Deploy glxgears and vulkan-smoketest inside container with apt-get update && apt-get install -y vulkan-utils mesa-utils
  3. Verify glxgears launches an X window with spinning gears and uses the GPU, this implies 'x-forwarding' to the host is working and OpenGL is using the GPU. Run glxgears in the container and after the window appears run nvidia-smi on the host.
  4. Run vulkan-smoketest and watch a black window briefly appear/disappear with Segmentation fault (core dumped) in the container's terminal. dmesg | tail on the host reports something like segfault at 0 ip ... sp ... error 4 in vulkan-smoketest...

For a sanity check with Singularity 3.7, convert the same image to a singularity image and run it, vulkan works fine:

$ cat vulkan.def 
Bootstrap: docker
From: nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04

%post
	apt-get update && apt-get install -y mesa-utils vulkan-utils && apt-get clean
$ singularity build --fakeroot vulkan.sif vulkan.def
...
$ singularity exec --nv vulkan.sif vulkan-smoketest

3. Information to attach (optional if deemed irrelevant)

  • [x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info:
    nvidia-container-cli.txt

  • [x] Kernel version from uname -a: Linux fedorarouge 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

  • [x] Any relevant kernel output lines from dmesg

    [ 6991.274805] vulkan-smoketes[60028]: segfault at 0 ip 00005588fe864b7e sp 00007ffd3c596460 error 4 in vulkan-smoketest[5588fe84e000+2c000]
    [ 6991.274810] Code: 29 c8 48 c1 f8 03 48 39 c6 77 61 73 09 48 8d 04 f1 48 89 44 24 18 4c 89 ea 4c 89 e6 48 89 ef ff 15 77 57 21 00 48 8b 7c 24 10 <48> 8b 07 48 c7 83 f8 00 00 00 00 00 00 00 48 c7 83 00 01 00 00 ff
    
  • [x] Driver information from nvidia-smi -a: nvidia-smi.txt

  • [x] Docker version from docker version docker.txt

  • [x] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*' nvidia_rpm.txt

  • [x] NVIDIA container library version from nvidia-container-cli -V:
    nvidia_container_cli.txt

  • [x] NVIDIA container library logs (see troubleshooting): nvidia-container-toolkit.txt

  • [x] Docker command, image and tag used: docker run --net=host --rm -it $(xForwardDockerRunArgs) --gpus=all nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04

qhaas avatar Apr 01 '21 13:04 qhaas

Also tested on a PopOS 18.04 (Ubuntu 18.04 based) system with a Pascal GeForce, nvidia driver 460.67, nvidia-container 1.3.3, and docker CE 20.10.5... same error

qhaas avatar Apr 06 '21 14:04 qhaas

Could this possibly be related to: https://github.com/NVIDIA/libnvidia-container/issues/134

klueska avatar Apr 13 '21 08:04 klueska

Could this possibly be related to: NVIDIA/libnvidia-container#134

Thanks; that is unlikely the same issue, since vulkan-smoketest works fine outside of containers (i.e. bare-metal) and it also works from inside a singularity container. The tested systems had but one GPU.

qhaas avatar Apr 13 '21 12:04 qhaas

@qhaas Did you figure out a workaround?

diadatp avatar Sep 07 '21 12:09 diadatp

while granting access to the XAUTHORITY file and XSOCKET as you describe it, works for OpenGL, it does not work somehow for vulkan. You have to activate the display option for NVIDIA_DRIVER_CAPABILITIES which is not set in the nvidia/vulkan base image. driver-capabilities)

So instead of --gpus=all use --gpus='all,"capabilities=compute,utility,graphics,display"' --env DISPLAY:$DISPLAY. This cost me days to find out :disappointed:

denwi248 avatar Nov 09 '21 15:11 denwi248