rerun icon indicating copy to clipboard operation
rerun copied to clipboard

Add troubleshooting or how-to guide related to running Rerun in a Docker container

Open jleibs opened this issue 1 year ago • 10 comments

This keeps coming up and even if we don't officially support it, would be good to set expectations and point people in the right direction.

At time of writing, with 0.22.1 the following works.

Install nvidia-container-runtime.

Make sure /etc/docker/daemon.json contains:

{
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  }
}

Make sure your docker image includes at least: libgtk-3-dev libxkbcommon-x11-0 vulkan-tools

Example Dockerfile for an ubuntu image.

FROM ubuntu:22.04

RUN apt-get update && apt-get install -y python3-pip libgtk-3-dev libxkbcommon-x11-0 vulkan-tools
RUN python3 -m pip install rerun-sdk==0.22.1

Jump through all the hoops necessary for GPU access, X authentication, etc.

In particular the following two seem to be crucial:

  • --runtime=nvidia
  • -e NVIDIA_DRIVER_CAPABILITIES=all

Example run.sh script:

XSOCK=/tmp/.X11-unix
XAUTH=/tmp/.docker.xauth
xauth nlist $DISPLAY | sed -e 's/^..../ffff/' | xauth -f $XAUTH nmerge -
chmod 777 $XAUTH
docker run --runtime=nvidia --rm --gpus all -it --privileged --network=host -e NVIDIA_DRIVER_CAPABILITIES=all -e DISPLAY=$DISPLAY -v $XSOCK:$XSOCK -v $XAUTH:$XAUTH -e XAUTHORITY=$XAUTH rerun:0.17.0 rerun

The above Dockerfile and run script can be used by running:

docker build -t rerun:0.22.1 .
./run.sh

jleibs avatar Jul 09 '24 17:07 jleibs

Hi! I am able to display the Rerun viewer when I run this code on my ssh server, but the viewer is extremely slow and reaches the point where it is not possible. Is there any way to improve this please?

Also has a warning here:

[2024-07-15T12:43:54Z WARN egui_winit::clipboard] Failed to initialize arboard clipboard: Unknown error while interacting with the clipboard: X11 server connection timed out because it was unreachable

Hillowwold avatar Jul 15 '24 12:07 Hillowwold

Hi! I am able to display the Rerun viewer when I run this code on my ssh server, but the viewer is extremely slow and reaches the point where it is not possible. Is there any way to improve this please?

Can you clarify, are you trying to run the viewer on a remote machine, and then tunnel the viewer over SSH, such as via X-forwarding? I would not expect this to work. The viewer is fairly graphics intensive and needs access to the local GPU.

To run on a remote machine, I would still recommend running the viewer locally, and then connecting to it from the client using the .connect() API. See: https://rerun.io/docs/reference/sdk-operating-modes

jleibs avatar Jul 15 '24 13:07 jleibs

I just found out that if I put

RUN apt update && apt install -q -y --no-install-recommends \
    libgtk-3-dev \
    libxkbcommon-x11-0 \
    vulkan-tools

inside Dockerfile, I get error on rerun command:

[2025-02-24T09:37:29Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-02-24T09:37:29Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_surface
[2025-02-24T09:37:29Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xlib_surface
[2025-02-24T09:37:29Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xcb_surface
[2025-02-24T09:37:29Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_wayland_surface
[2025-02-24T09:37:29Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_EXT_swapchain_colorspace
[2025-02-24T09:37:29Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_get_physical_device_properties2
libEGL warning: DRI3: Screen seems not DRI3 capable
libEGL warning: DRI2: failed to authenticate
[2025-02-24T09:37:29Z ERROR eframe::native::run] Exiting because of error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.
Error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.

But if I install these 3 packages manually in creted container shell, rerun works fine. Why could that be?

Divelix avatar Feb 24 '25 09:02 Divelix

I used latest 0.22.1 version, after downgrade to 0.17.0 it runs fine without manual apt install.

Divelix avatar Feb 24 '25 13:02 Divelix

@Divelix try setting the software rasterizer (or better an actual driver) explicitly via an ICD file as described here 0.17.0 will likely run into some runtime error once you try to interact with anything in 2D or 3D views.

Wumpf avatar Mar 03 '25 14:03 Wumpf

@Wumpf interaction with 2D and 3D views in container works fine for 0.17.0. Where shoulld I export ICD? On host or inside container? If container, then I don't even have /usr/share/vulkan path in it (but on host I have).

Divelix avatar Mar 03 '25 15:03 Divelix

@Divelix I have just confirmed that the instructions listed in this issue continue to work for me on 0.22.1 (and have updated accordingly).

Are you using that exact Dockerfile and run command?

In general, the best way to get support on this is to include as much information as possible:

  • The host OS you are using (i.e. output of uname -a)
  • The docker version you are using (i.e. output of docker --version)
  • The host OS nvidia driver version (i.e. output of nvidia-smi)
  • The nvidia-container-runtime version you are using
  • A copy of the contents of the Dockerfile
  • The exact docker run invocation you are using

jleibs avatar Mar 04 '25 17:03 jleibs

@jleibs interesting, I had to use prime-select nvidia previously, now switched to prime-select on-demand and your Dockerfile+run command worked fine for both 0.17.0 and 0.22.1 - gui starts without errors. It seems like dockerized rerun requires intergated gpu for some reason, while rerun on host doesn't.

My system:

  • linux ... 6.8.0-52-generic #53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Docker version 28.0.1, build 068a01e
  • Driver Version: 535.183.01 CUDA Version: 12.2
  • nvidia-container-runtime:
commit: 9b69590c7428470a72f2ae05f826412976af1395
spec: 1.2.0

runc version 1.2.4
commit: v1.2.4-0-g6c52b3f
spec: 1.2.0
go: go1.22.10
libseccomp: 2.5.3
  • Dockerfile is the same as in your example
  • Run invocation the same as yours

Divelix avatar Mar 05 '25 08:03 Divelix

@jleibs to be clear, using your instructions (Dockerfile + run script) works only for prime-select on-demand, while choosing prime-select nvidia throws error for your instructions.

Error message for `prime-select nvidia`:
[2025-03-05T08:51:33Z INFO  re_sdk_comms::server] Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
[2025-03-05T08:51:33Z INFO  winit::platform_impl::platform::x11::window] Guessed window scale factor: 1
error: XDG_RUNTIME_DIR not set in the environment.
libEGL warning: DRI3: Screen seems not DRI3 capable
libEGL warning: DRI2: failed to authenticate
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
[2025-03-05T08:51:33Z INFO  egui_wgpu] There were 3 available wgpu adapters: {backend: Vulkan, device_type: IntegratedGpu, name: "Intel(R) UHD Graphics (ADL-S GT1)", driver: "Intel open-source Mesa driver", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3", vendor: 0x8086, device: 0x4688}, {backend: Vulkan, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "llvmpipe", driver_info: "Mesa 23.2.1-1ubuntu3.1~22.04.3 (LLVM 15.0.7)", vendor: 0x10005}, {backend: Gl, device_type: Cpu, name: "llvmpipe (LLVM 15.0.7, 256 bits)", driver: "OpenGL", driver_info: "4.5 (Compatibility Profile) Mesa 23.2.1-1ubuntu3.1~22.04.3", vendor: 0x10005}

    Welcome to Rerun!

    This open source library collects anonymous usage data to
    help the Rerun team improve the library.

    Summary:
    - We only collect high level events about the features used within the Rerun Viewer.
    - The actual data you log to Rerun, such as point clouds, images, or text logs,
      will never be collected.
    - We don't log IP addresses.
    - We don't log your user name, file paths, or any personal identifiable data.
    - Usage data we do collect will be sent to and stored on servers within the EU.

    For more details and instructions on how to opt out, run the command:

      rerun analytics details

    As this is this your first session, _no_ usage data has been sent yet,
    giving you an opportunity to opt-out first if you wish.

    Happy Rerunning!

[2025-03-05T08:51:34Z ERROR winit::platform_impl::platform] X11 error: XError {
        description: "BadMatch (invalid parameter attributes)",
        error_code: 8,
        request_code: 149,
        minor_code: 4,
    }
[2025-03-05T08:51:34Z ERROR winit::platform_impl::platform] X11 error: XError {
        description: "BadMatch (invalid parameter attributes)",
        error_code: 8,
        request_code: 149,
        minor_code: 4,
    }
[2025-03-05T08:51:34Z ERROR winit::platform_impl::platform] X11 error: XError {
        description: "BadMatch (invalid parameter attributes)",
        error_code: 8,
        request_code: 149,
        minor_code: 4,
    }

thread 'main' panicked at 'Failed to call XMapRaised: XError { description: "BadMatch (invalid parameter attributes)", error_code: 8, request_code: 149, minor_code: 4 }'
winit-0.29.15/src/platform_impl/linux/x11/window.rs:1208
stack backtrace:
   6: core::panicking::panic_fmt
             at core/src/panicking.rs:72:14
   7: core::result::unwrap_failed
             at core/src/result.rs:1649:5
   8: eframe::native::epi_integration::EpiIntegration::post_rendering
   9: <eframe::native::wgpu_integration::WgpuWinitApp as eframe::native::winit_integration::WinitApp>::run_ui_and_paint
  10: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
  11: winit::platform_impl::platform::x11::EventLoop<T>::run_on_demand
  12: eframe::native::run::run_wgpu
  13: eframe::run_native

Troubleshooting Rerun: https://www.rerun.io/docs/getting-started/troubleshooting
Report bugs: https://github.com/rerun-io/rerun/issues

But, while struggling with rerun in container, I found a way to launch rerun container on prime-select nvidia by using this dockerfile (run sctipt is the same):

FROM nvcr.io/nvidia/pytorch:23.12-py3

RUN apt update && apt install -q -y --no-install-recommends \
    libgtk-3-dev \
    libxkbcommon-x11-0 \
    vulkan-tools

RUN pip install -U pip && pip install \
    rerun-sdk==0.17.0

But, this dockerfile works only for rerun==0.17.0 and if you change it to 0.22.1 it also breaks

Error for 0.22.1:
[2025-03-05T09:08:10Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-03-05T09:08:10Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_surface
[2025-03-05T09:08:10Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xlib_surface
[2025-03-05T09:08:10Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xcb_surface
[2025-03-05T09:08:10Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_wayland_surface
[2025-03-05T09:08:10Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_EXT_swapchain_colorspace
[2025-03-05T09:08:10Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_get_physical_device_properties2
error: XDG_RUNTIME_DIR not set in the environment.
libEGL warning: DRI3: Screen seems not DRI3 capable
libEGL warning: DRI2: failed to authenticate
[2025-03-05T09:08:11Z ERROR eframe::native::run] Exiting because of error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.
Error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.

Divelix avatar Mar 05 '25 09:03 Divelix

I'm also unable to run it in a docker container it seems. Not sure if the following console messages are useful, but when I run my (c++) application it gives,

Error: winit EventLoopError: os error at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.30.7/src/platform_impl/linux/mod.rs:787: Failed to load one of xlib's shared libraries
[2025-03-29T17:33:07Z WARN  re_sdk_comms::buffered_client] Failed to send message after 3 attempts: Failed to connect to Rerun server at 127.0.0.1:9876: Connection refused (os error 111)
[2025-03-29T17:33:10Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
[2025-03-29T17:33:10Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
[2025-03-29T17:33:10Z WARN  re_sdk_comms::tcp_client] Tried to flush while TCP stream was still Pending. Data was possibly dropped.

jlack1987 avatar Mar 29 '25 17:03 jlack1987