wgpu [hal/vk] Rework Submission and Surface Synchronization

Connections

Closes #5559
Closes #5693
Maybe #4919
Maybe #5374
Maybe #5491
Probably not #3452.

Description

As described in #5559, our submission and surface synchronization was totally messed up. I've revamped it, made smarter classes to help keep all the bookkeeping straight, and generally make the code cleaner and easier to understand.

Testing

works on my machine lel

May 08 '24 22:05 cwfitzgerald

Works fine for me running the cube example with VK validation layers enabled

[2024-05-09T06:03:33Z INFO  wgpu_core::instance] Adapter Vulkan AdapterInfo { name: "Intel(R) Arc(TM) Graphics", vendor: 32902, device: 32085, device_type: IntegratedGpu, driver: "Intel Corporation", driver_info: "101.5382", backend: Vulkan }

May 09 '24 06:05 JMS55

This fixes an annoying access violation termination (that I thought was a driver bug) on program exit after a present.

May 09 '24 07:05 Vecvec

Unfortunately, it does not fix #5644 for me. Same constant validation error on presentation:

[2024-05-09T08:16:24Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkAcquireNextImageKHR-semaphore-01779 (0x5717e75b)]
        Validation Error: [ VUID-vkAcquireNextImageKHR-semaphore-01779 ] Object 0: handle = 0xba7514000000002a, type = VK_OBJECT_TYPE_SEMAPHORE; | MessageID = 0x5717e75b | vkAcquireNextImageKHR():  Semaphore must not have any pending operations. The Vulkan spec states: If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAcquireNextImageKHR-semaphore-01779)
[2024-05-09T08:16:24Z ERROR wgpu_hal::vulkan::instance]         objects: (type: SEMAPHORE, hndl: 0xba7514000000002a, name: ?)

The actual validation error only happens with a debug build, although release feels much slower than 0.19 as well. I believe something is very slow and the errors actually only show up in debug mode.

Flamegraph of cube seems to indicate this—with present taking a huge amount of time:

flamegraph

For comparison, same drivers with 0.19:

flamegraph

Even if I enable Immediate present mode, frames seem to take twice as much time to be rendered with 0.20.

May 09 '24 08:05 hecrj

@hecrj could you try again with these latest commits, someone else who can reproduce the validation error says that this fixed it.

May 10 '24 23:05 cwfitzgerald

@cwfitzgerald Yep! The latest changes fix the validation errors. Thanks!

I'm still noticing a bit of worse performance in debug mode, but it doesn't seem to be there in release mode anymore. Flamegraphs look the same as before, but I imagine it's expected since now synchronization works properly and there may be more waiting. No errors in any case!

May 11 '24 10:05 hecrj

I'm actually rewriting these comments as I review so if you're not attached to them then don't worry too much about fixing anything.

May 14 '24 01:05 jimblandy

Yeah that's fine - I was just trying to get something there.

May 14 '24 02:05 cwfitzgerald

@cwfitzgerald What do you think of this? https://github.com/gfx-rs/wgpu/commit/da543fb51

My hope was that this would make the states of RelaySemaphores a bit clearer to a new reader. If people should not use wait unless should_wait is true, then let's actually force people to check that condition before they can even get at the value: classic Option. But it does make advance fallible and require us to pass the device.

[edit: made this a little nicer still: RelaySemaphoreState is no longer necessary]

May 17 '24 20:05 jimblandy

@cwfitzgerald Here's an even more radical suggestion: https://github.com/gfx-rs/wgpu/commit/b5d52fed45e4f89213e6fe5870e99126865711aa

Just use a single semaphore for all queue ordering.

May 17 '24 21:05 jimblandy

@jimblandy i think we can go for it, the only question is if we want to still avoid this deadlock in anv mentioned in the old code, it if we consider that old enough to ignore.

/// It would be correct to use a single semaphore there, but 
/// [Intel hangs in `anv_queue_finish`](https://gitlab.freedesktop.org/mesa/mesa/-/issues/5508).

I should have perseved that comment somehow.

May 17 '24 21:05 cwfitzgerald

@cwfitzgerald:

I should have perseved that comment somehow.

Ahh, yes, I was reviewing by checking out the branch and just wandering around with rust-analyzer, so I didn't notice that the patch removed the comment. If that's the sole rationale for the entire semaphore-swapping doohickey then we had definitely better have the comment.

Kvark filed that bug in Oct 2021. That's not that long ago, so I think we'd better keep it. It sounds like a bear to debug, and I'd rather not have to debug it again.

May 18 '24 04:05 jimblandy

So I guess the remaining question is what you think of this one.

May 18 '24 04:05 jimblandy

So here's the latest version of my suggestion: https://github.com/jimblandy/wgpu/compare/hal-vk-wait-semaphore-is-option%7E2...jimblandy:wgpu:hal-vk-wait-semaphore-is-option

May 18 '24 05:05 jimblandy

Looks fine to me, could you either PR to my fork or just push to it?

May 18 '24 21:05 cwfitzgerald

I still want to get a handle on the swapchain semaphores. I've got some work-in-progress docs for those too. I'll take care of this tomorrow morning.

May 20 '24 03:05 jimblandy

@cwfitzgerald I just pushed two doc commits. Could you look them over for accuracy?

May 21 '24 01:05 jimblandy

rebased on current trunk

May 23 '24 22:05 jimblandy

wgpu wgpu copied to clipboard

[hal/vk] Rework Submission and Surface Synchronization

wgpu
wgpu copied to clipboard