rerun icon indicating copy to clipboard operation
rerun copied to clipboard

rerun 0.22.1 freezes after timeline manipulations

Open azerupi opened this issue 8 months ago • 9 comments

Describe the bug When opening an example and enthusiastically click around in the timeline to move the cursor at random points rerun freezes and I have to kill it. I usually use the 2D plot example, but can reproduce this on other examples as well.

To Reproduce Steps to reproduce the behavior:

  1. Open 2D plot example and pause it
  2. Click randomly in the timeline at a fast pace
  3. rerun viewer freezes (lockup of the UI)

Note: The steps above seem to reproduce the problem consistently, but they don't seem like the only way for rerun to freeze on me.

Expected behavior No UI freeze

Screenshots

https://github.com/user-attachments/assets/77cc0a59-f160-4c90-900a-d265f8bb3a30

Desktop (please complete the following information):

  • OS: Ubuntu 24.04.2 LTS

Rerun version

rerun-cli 0.22.1 (default map_view native_viewer web_viewer) [rustc 1.85.0 (4d91de4e4 2025-02-17), LLVM 19.1.7] x86_64-unknown-linux-gnu
Video features: av1 default ffmpeg serde

But also tried on latest master compiled locally (8476a6fabe578e7c3fea81b2775489343e1ced47)

rerun-cli 0.23.0-alpha.1+dev 
x86_64-unknown-linux-gnu
rerun-cli features: map_view nasm native_viewer
rustc 1.84.0 (9fc6b4312 2025-01-07), LLVM 19.1.5

Additional context Considering this is a pretty crippling bug and that I haven't seen it reported, I'm assuming something in my environment is causing or triggering this. Any tips on how I can debug this are welcome.

azerupi avatar Mar 29 '25 12:03 azerupi

This is very surprising! And indeed, I fail to reproduce.

My best suggestion for tracking this down is to attach a debugger (e.g. gdb) and try to get a stack trace out of it. You can also check out the and run cargo run -p rerun-cli to get a debug build, which should have better debug symbols.

emilk avatar Mar 31 '25 07:03 emilk

Attached is the gdb list and backtrace of all threads after doing Ctrl-C in gdb when rerun starts hanging.

You can also check out the and run cargo run -p rerun-cli to get a debug build

I'm unable to build with that command, I get an error Web viewer not found, run pixi run rerun-build-web to build it! but when I run that command it also fails with a bunch of errors saying it can't find a bunch of crates. Is there something specific I should do before building?

pixi run rerun works successfully though.

gdb.txt

azerupi avatar Mar 31 '25 18:03 azerupi

Thanks for that callstack!

Thread 1 (Thread 0x7ffff7e6e400 (LWP 954595) "rerun"):
#0  0x00007ffff7d1b4cd in __GI___poll (fds=0x7fffffff6678, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff7a648ca in ?? () from /lib/x86_64-linux-gnu/libxcb.so.1
#2  0x00007ffff7a666dc in xcb_wait_for_special_event () from /lib/x86_64-linux-gnu/libxcb.so.1
#3  0x00007ffff6776a3c in ?? () from /lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#4  0x00007ffff6758ebb in ?? () from /lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#5  0x00007ffff6759349 in ?? () from /lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#6  0x00007fffc8eb9683 in ?? () from /lib/x86_64-linux-gnu/libnvidia-glcore.so.570.124.04
#7  0x00007fffc9123a5b in ?? () from /lib/x86_64-linux-gnu/libnvidia-glcore.so.570.124.04
#8  0x00007fffc92a67a6 in ?? () from /lib/x86_64-linux-gnu/libnvidia-glcore.so.570.124.04
#9  0x00007ffff6773aa0 in ?? () from /lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#10 0x0000555556fa5e68 in <wgpu_hal::vulkan::Queue as wgpu_hal::Queue>::present ()
#11 0x0000555556f42f79 in <Q as wgpu_hal::dynamic::queue::DynQueue>::present ()
#12 0x0000555556ed6518 in wgpu_core::present::<impl wgpu_core::instance::Surface>::present ()
#13 0x0000555556ea17b7 in wgpu_core::present::<impl wgpu_core::global::Global>::surface_present ()
#14 0x0000555556e84226 in wgpu::api::surface_texture::SurfaceTexture::present ()
#15 0x0000555556c142bf in egui_wgpu::winit::Painter::paint_and_update_textures ()
#16 0x000055555693fd14 in <eframe::native::wgpu_integration::WgpuWinitApp as eframe::native::winit_integration::WinitApp>::run_ui_and_paint ()
#17 0x00005555569232cc in eframe::native::event_loop_context::with_event_loop_context ()
#18 0x00005555569a17e8 in <eframe::native::run::WinitAppWrapper<T> as winit::application::ApplicationHandler<eframe::native::winit_integration::UserEvent>>::window_event ()
#19 0x0000555556906ff2 in core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut ()
#20 0x000055555698ce08 in winit::platform_impl::linux::x11::EventLoop<T>::run_on_demand ()
#21 0x00005555569a2162 in eframe::native::run::run_wgpu ()
#22 0x000055555695079b in eframe::run_native ()
#23 0x0000555555bcffb8 in re_viewer::native::run_native_app ()
#24 0x0000555555a5fb10 in rerun::commands::entrypoint::run_impl ()
#25 0x0000555555a3e38d in rerun::commands::entrypoint::run ()
#26 0x0000555555a3f2ee in rerun::main ()
#27 0x0000555555a3fdf3 in std::sys::backtrace::__rust_begin_short_backtrace ()
#28 0x0000555555a3edb9 in std::rt::lang_start::{{closure}} ()
#29 0x000055555891f297 in std::rt::lang_start_internal ()
#30 0x0000555555a3f625 in main ()

This is happening deep down in the nvidia drivers after a call to present(). I suspect a driver bug. What do you think @Wumpf?

emilk avatar Apr 01 '25 07:04 emilk

Agreed, this smells driver bug or less likely a remaining issue in the wgpu Vulkan sync behavior which has been reworked last year. But I haven't found any issues on wgpu reported about this.

@azerupi What nvidia driver version you're running on (callstack indicates Nvidia)? Is your machine multi-gpu? Other things to try out to see what happens:

  • disabling wayland with WAYLAND_DISPLAY=
  • run with --renderer gl to enforce the fallback opengl renderer

Background information I digged up:

vkQueuePresentKHR - which is likely where we're stuck here - is according to spec allowed to block inside the driver. However this is generally unexpected, as all the presentation related waiting should be happening inside get_current_texture (that's acquire_texture in wgpu-hal). The present call itself gets passed in semaphores signaled for the last queue submission using the relevant surface, but that is supposed to be all happening asynchronously: each submission flags a semaphore when it's done. When the last submission using the surface that we want to present in vkQueuePresentKHR is done, the present is allowed to happen, all in parallel to the main application.

Wumpf avatar Apr 01 '25 10:04 Wumpf

Thank you so much for looking into this!

What nvidia driver version you're running on

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  570.124.04  Tue Feb 25 04:12:12 UTC 2025
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

Is your machine multi-gpu?

If you mean multiple discrete GPU cards, then no. If you mean both integrated graphics and discrete gpu card, then yes. It is a HP ZBook with the intel integrated graphics and a nvidia rtx 4070 mobile.

lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-P [Iris Xe Graphics] (rev 04)
        DeviceName: Onboard IGD
        Subsystem: Hewlett-Packard Company Raptor Lake-P [Iris Xe Graphics]
        Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation AD106M [GeForce RTX 4070 Max-Q / Mobile] (rev a1)
        Subsystem: Hewlett-Packard Company AD106M [GeForce RTX 4070 Max-Q / Mobile]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

disabling wayland with WAYLAND_DISPLAY=

I'm actually not running wayland. I believe I disabled it after some issues with other tools. echo $WAYLAND_DISPLAY returns nothing. echo "$XDG_SESSION_TYPE" returns x11.

run with --renderer gl to enforce the fallback opengl renderer

That actually doesn't work either. I'm getting the following error:

rerun --renderer gl
[2025-04-01T13:56:04Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-04-01T13:56:04Z ERROR eframe::native::run] Exiting because of error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.
Error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.

I also tried exporting VK_ICD_FILENAMES as explained in https://rerun.io/docs/getting-started/troubleshooting but in both cases (integrated vs nvidia) rerun doesn't launch and I get the same error:

╰─ rerun
[2025-04-01T14:02:11Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-04-01T14:02:11Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_surface
[2025-04-01T14:02:11Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xlib_surface
[2025-04-01T14:02:11Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xcb_surface
[2025-04-01T14:02:11Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_wayland_surf
ace
[2025-04-01T14:02:11Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_EXT_swapchain_co
lorspace
[2025-04-01T14:02:11Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_get_physical
_device_properties2
[2025-04-01T14:02:11Z ERROR eframe::native::run] Exiting because of error: app creation error: The GP
U/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Fl
oat. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and con
sider updating your graphics driver.
Error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not suppor
t drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/gettin
g-started/troubleshooting and consider updating your graphics driver.

azerupi avatar Apr 01 '25 14:04 azerupi

thanks for following up with all that detailed information! The mixed GPU setup is likely relevant, there tends to be finicky things around presentation APIs in these setups 🤔

GL not working is interesting, might need a mesa update? Not too familiar with the ecosystem there, confuses me every time :/. The error printed out is from our side and indicates that the GL driver exposes a very limited/outdated feature set - drawing to float textures is supported virtually everywhere for over a decade. I'm surprised this happens on Ubuntu 24, thought this ships with everything new enough. But a little bit beside the point as from our pov GL is more of a fallback...

Recently a new nvidia driver released, 570.133.07 and I found mentions that it resolves hangs, but I haven't found a full changelog for it -.-. But sounds like that's worth a shot.

When you tried with the icd files, did you get the error both for your intel and nvidia gpus? The error sounds like it ended up picking GL because it didn't find needed Vulkan extensions (from the looks of that output snippet that Vulkan driver didn't provide anything for outputting things to the screen/compositor).

Wumpf avatar Apr 01 '25 15:04 Wumpf

thanks for following up with all that detailed information!

No need to thank me! You are giving help for an issue that is most likely on my end, it is only right for me to provide you with the requested information.

When you tried with the icd files, did you get the error both for your intel and nvidia gpus? The error sounds like it ended up picking GL because it didn't find needed Vulkan extensions (from the looks of that output snippet that Vulkan driver didn't provide anything for outputting things to the screen/compositor).

Yes, same output

Intel Graphics

$ export VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/intel.json

$ rerun --version
rerun-cli 0.22.1 (default map_view native_viewer web_viewer) [rustc 1.85.0 (4d91de4e4 2025-02-17), LLVM 19.1.7] x86_64-unknown-linux-gnu
Video features: av1 default ffmpeg serde

$ rerun
[2025-04-01T16:36:36Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-04-01T16:36:36Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_surface
[2025-04-01T16:36:36Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xlib_surface
[2025-04-01T16:36:36Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xcb_surface
[2025-04-01T16:36:36Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_wayland_surface
[2025-04-01T16:36:36Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_EXT_swapchain_colorspace
[2025-04-01T16:36:36Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_get_physical_device_properties2
[2025-04-01T16:36:36Z ERROR eframe::native::run] Exiting because of error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.
Error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.

Nvidia

$ export VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia.json

$ rerun --version
rerun-cli 0.22.1 (default map_view native_viewer web_viewer) [rustc 1.85.0 (4d91de4e4 2025-02-17), LLVM 19.1.7] x86_64-unknown-linux-gnu
Video features: av1 default ffmpeg serde

$ rerun
[2025-04-01T16:38:52Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-04-01T16:38:52Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_surface
[2025-04-01T16:38:52Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xlib_surface
[2025-04-01T16:38:52Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_xcb_surface
[2025-04-01T16:38:52Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_wayland_surface
[2025-04-01T16:38:52Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_EXT_swapchain_colorspace
[2025-04-01T16:38:52Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_KHR_get_physical_device_properties2
[2025-04-01T16:38:52Z ERROR eframe::native::run] Exiting because of error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.
Error: app creation error: The GPU/graphics driver is lacking some abilities: Adapter does not support drawing to texture format R32Float. Check the troubleshooting guide at https://rerun.io/docs/getting-started/troubleshooting and consider updating your graphics driver.

Recently a new nvidia driver released, 570.133.07 and I found mentions that it resolves hangs, but I haven't found a full changelog for it -.-. But sounds like that's worth a shot.

Agreed, but that version isn't available yet as a package as far as I can see and considering this is my work laptop I would prefer not messing around with installing the drivers manually 🙂

If there is nothing else I can try in the meantime, we can put this on hold and I can report back when the driver becomes available to me.

azerupi avatar Apr 01 '25 16:04 azerupi

huh the behavior with the icd files is quite strange! At least one of the VK_ICD_FILENAMES overrides should behave like not specifying VK_ICD_FILENAMES. In a nutshell this is just telling the vulkan loader where to look for drivers. Maybe worth checking what files you have in /usr/share/vulkan/icd.d/ - I figure one of them should be at least have the reported hanging behavior rather than not working at all.

Wumpf avatar Apr 01 '25 18:04 Wumpf

Yes, my bad. The files are actually called nvidia_icd.json and intel_icd.x86_64.json instead of just intel.json and nvidia.json. Might be worth mentioning in the docs that the files could be named slightly differently for people like me that blindly copy-paste the commands 😉

As expected, I am able to reproduce the slowdown with the nvidia file and not with the intel one. Let's see if a nvidia driver update in the next weeks fixes the issue. I don't want to waste more of your time if the issue is indeed on the nvidia driver side.

azerupi avatar Apr 01 '25 20:04 azerupi

After updating to the NVIDIA driver to version 570.144 it looks like the issue is fixed. Thank you very much for the support!

azerupi avatar Apr 27 '25 17:04 azerupi

Thanks for reporting back on this!

Wumpf avatar Apr 27 '25 22:04 Wumpf