wgpu Vulkan surface creation fails on wayland

Description wgpu does not run when Wayland is detected (when WAYLAND_DISPLAY is not empty). Sometimes they work if using the gl backend; for example, ruffle is fine with this, but the examples are not.

Repro steps using mipmap here instead of hello_triangle because hello_triangle seems to force Vulkan

succeeds:

WAYLAND_DISPLAY= cargo run --bin wgpu-examples mipmap

fails:

cargo run --bin wgpu-examples mipmap

Expected vs observed behavior expected: opened window with triangle on screen observed: crash before window opens, with various errors

mipmap vulkan:

thread 'main' panicked at examples/src/framework.rs:188:14:
Surface isn't supported by the adapter.

mipmap opengl:

[2024-09-25T01:13:14Z ERROR wgpu_hal::gles::egl] EGL 'eglMakeCurrent' code 0x3008: eglMakeCurrent
thread 'main' panicked at wgpu-hal/src/gles/egl.rs:298:14:
called `Result::unwrap()` on an `Err` value: BadDisplay
stack backtrace:
...

hello_triangle

[2024-09-25T01:03:53Z ERROR wgpu_hal::vulkan::adapter] get_physical_device_surface_present_modes: ERROR_SURFACE_LOST_KHR
[2024-09-25T01:03:53Z ERROR wgpu_hal::vulkan::adapter] get_physical_device_surface_formats: ERROR_SURFACE_LOST_KHR
thread 'main' panicked at wgpu/src/backend/wgpu_core.rs:722:18:
Error in Surface::configure: Validation Error

Caused by:
  Requested present mode Mailbox is not in the list of supported present modes: []

Extra materials bunch of logs of various configurations (vulkan, opengl, x11, wayland, hello_triangle, mipmap) made with:

[WAYLAND_DISPLAY=] [WGPU_BACKEND=gl] RUST_LOG=trace cargo run --profile dev --bin wgpu-examples (hello_triangle|mipmap) 2>&1 | cat >log-name.log

triangle-vulkan-x11.log triangle-vulkan-wayland.log mipmap-vulkan-x11.log mipmap-vulkan-wayland.log mipmap-opengl-x11.log mipmap-opengl-wayland.log

Related issue: https://github.com/ruffle-rs/ruffle/issues/17948

Platform Dell G16 7620 laptop with i7 12700H and NVIDIA RTX 3060 Arch Linux (kernel 6.10.10-arch1-1) wgpu 859dd8817e7484b51823d443d7cac93c6e9a7ef2 "Intel(R) Graphics (ADL GT2)" and "NVIDIA GeForce RTX 3060 Laptop GPU" in Optimus configuration (using Prime render offload) mesa version 24.2.3 KDE Plasma 6.1.5 egl-wayland 1.16 NVIDIA-SMI version : 560.35.03 NVML version : 560.35 DRIVER version : 560.35.03 CUDA Version : 12.6

Sep 25 '24 01:09 JL2210

Thanks for all the logs on those variations! May be related to the hybrid setup. I know that for some folks Vulkan on wayland works just fine and for other it doesn't.

I read somewhere that the newer Nvidia drivers (v555+) may fix some of those issues, worth a shot

Sep 25 '24 07:09 Wumpf

~This sounds like a duplicate of~ The EGL error is covered in https://github.com/gfx-rs/wgpu/issues/5505, where we have already outlined some of the wrongdoings inside the gles module.

EDIT: Can we fix the surfaec -> surface typo in the title? And assign some useful labels to #5505?

Sep 25 '24 08:09 MarijnS95

Thanks for all the logs on those variations! May be related to the hybrid setup. I know that for some folks Vulkan on wayland works just fine and for other it doesn't.

I read somewhere that the newer Nvidia drivers (v555+) may fix some of those issues, worth a shot

Somehow forgot to mention this, I'm on 560:

NVIDIA-SMI version  : 560.35.03
NVML version        : 560.35
DRIVER version      : 560.35.03
CUDA Version        : 12.6

Sep 25 '24 18:09 JL2210

Unsure if related, but I have tracked down one instance of this error:

wp_linux_drm_syncobj_manager_v1#62: error 0: surface already exists

followed by:

internal error: entered unreachable code: Fallback system failed to choose present mode. This is a bug. Mode: AutoVsync, Options: []

Turns out neither wgpu nor winit (at least in the past? I found this out while updating winit and I haven't updated wgpu yet) was detecting that instance.create_surface(&window) was being called twice, without dropping the first surface in between, leading to broken internals.

This might not even have anything to do with Wayland and more with how winit's Event::Resumed works (in our case, we had an instance.create_surface(&window) in its handler, without dropping any existing surfaces first).

I tested with:

let evil_surface = instance.create_surface(&window).unwrap();
evil_surface.configure(&device, &evil_surface.get_default_config(&adapter, 1, 1).unwrap());

(just after the device is created)

And it does break Wayland, but not X. Sometimes I get this:

[destroyed object]: error 0: surface already exists

(presumably when I do drop the second surface while evil_surface remains live - but it has the same shape as the wp_linux_drm_syncobj_manager_v1 error so I suspect it's also from wayland-client?)

By enabling RUST_LOG=wgpu_hal I also get:

[2024-12-16T03:25:00Z ERROR wgpu_hal::vulkan::adapter] get_physical_device_surface_present_modes: ERROR_SURFACE_LOST_KHR
[2024-12-16T03:25:00Z ERROR wgpu_hal::vulkan::adapter] get_physical_device_surface_formats: ERROR_SURFACE_LOST_KHR
[2024-12-16T03:25:00Z ERROR wgpu_hal::vulkan::adapter] get_physical_device_surface_present_modes: ERROR_SURFACE_LOST_KHR
[2024-12-16T03:25:00Z ERROR wgpu_hal::vulkan::adapter] get_physical_device_surface_formats: ERROR_SURFACE_LOST_KHR

In the triangle-vulkan-wayland.log provided by @JL2210, I can see the same pattern: ERROR_SURFACE_LOST_KHR (after a Wayland error) which doesn't seem to lead to automatic invalidation of the wgpu Surface.

I can't tell for sure from the Vulkan WSI docs for Wayland, but it's possible each wl_surface may only have at most one VkSurfaceKHR attached to it?

Looking at the wp_linux_drm_syncobj_manager_v1::get_surface docs, it certainly looks like this is only supported once per surface:

If the given wl_surface already has an explicit synchronization object associated, the surface_exists protocol error is raised.

However, I think I've also found a bug in Mesa? A relevant bit of code does goto fail (if that get_surface request fails), without setting result, which will remain VK_SUCCESS from an earlier assignment.

(Also, all of this only happens at "create swapchain" time, not initial "create surface" time, which probably explains why configuring a surface is required)

EDIT: opened Mesa issue:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12316

At this point I still don't know whose responsibility it is to prevent more than one wgpu::Surface per wl_surface from existing (or deduplicating behind the scenes, but that would probably behave very strangely).

I'll go at least report the Mesa bug.

Dec 16 '24 03:12 eddyb

~~If wgpu's own examples fail on Wayland, then this is to blame:~~ https://github.com/gfx-rs/wgpu/blob/3cc63af1a5b13bbf4c133d648d501c1016b9c591/examples/src/framework.rs#L178

~~Doing drop(self.surface.take()); first might fix it, though I'm not 100% sure of all the interactions here or why that would work.~~

~~Looking more at Mesa's surface destruction logic, I suppose wp_linux_drm_syncobj_surface_v1::destroy being invoked "releases" the wl_surface for reuse (as in, the next wp_linux_drm_syncobj_manager_v1::get_surface call will probably succeed, creating a new object, and attaching it until it itself is destroyed etc.).~~

EDIT: nevermind, the wgpu examples work fine for me. So there might still be an Nvidia driver bug for @JL2210?

Although, I can reproduce one potential issue:

$ RUST_LOG=wgpu_core::device::global=debug cargo -q run --bin wgpu-examples cube
[2024-12-16T05:19:25Z INFO  wgpu_examples::framework] Initializing wgpu...
[2024-12-16T05:19:25Z INFO  wgpu_core::instance] Adapter AdapterInfo { name: "AMD Radeon RX Vega (RADV VEGA10)", vendor: 4098, device: 26751, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.2.6", backend: Vulkan }
[2024-12-16T05:19:25Z INFO  wgpu_examples::framework] Using AMD Radeon RX Vega (RADV VEGA10) (Vulkan)
[2024-12-16T05:19:25Z INFO  wgpu_examples::framework] Entering event loop...
[2024-12-16T05:19:25Z INFO  wgpu_examples::framework] Surface resume PhysicalSize { width: 1200, height: 900 }
[2024-12-16T05:19:25Z DEBUG wgpu_core::device::global] configuring surface with SurfaceConfiguration { usage: TextureUsages(RENDER_ATTACHMENT), format: Rgba8UnormSrgb, width: 1200, height: 900, present_mode: Mailbox, desired_maximum_frame_latency: 2, alpha_mode: Auto, view_formats: [Rgba8UnormSrgb] }
[2024-12-16T05:19:25Z INFO  wgpu_examples::framework] Surface resize PhysicalSize { width: 1200, height: 900 }
[2024-12-16T05:19:25Z DEBUG wgpu_core::device::global] configuring surface with SurfaceConfiguration { usage: TextureUsages(RENDER_ATTACHMENT), format: Rgba8UnormSrgb, width: 1200, height: 900, present_mode: Mailbox, desired_maximum_frame_latency: 2, alpha_mode: Auto, view_formats: [Rgba8UnormSrgb] }

~~Are those different surface or the same one?~~ (EDIT: looks like same surface, resume+resize)

~~wgpu::Surface::configure's usage of Vulkan APIs might be more broken for Nvidia than Mesa.~~ (EDIT: that was pointless speculation, too: looks like @JL2210's Nvidia driver just doesn't support Wayland, and some of the logs involve the Intel driver instead)

Dec 16 '24 04:12 eddyb

@JL2210 I don't see anyone requesting it, but could you share vulkaninfo output? That would clearly show what's supported and what isn't.

Based on vulkan.gpuinfo.org's list of reports lacking VK_KHR_wayland_surface, the most recent Nvidia driver which lacks it (for your GPU model) seems to be 566.3.0.0, thought it does appear to be present in older versions (what is going on?!).

Probably best to check locally what's being reported by vulkaninfo, anyway.

Dec 16 '24 06:12 eddyb

While likely different from any issues with wgpu's own examples (like what @JL2210 was reporting originally), here is the PR fixing the wgpu::Surface misuse on Wayland that I ran into (in Rust-GPU's wgpu-based example):

https://github.com/Rust-GPU/rust-gpu/pull/181

Dec 17 '24 20:12 eddyb