wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

[hal/vk] Rework Submission and Surface Synchronization

Open cwfitzgerald opened this issue 3 months ago • 10 comments

Connections

  • Closes #5559
  • Closes #5693
  • Maybe #4919
  • Maybe #5374
  • Maybe #5491
  • Probably not #3452.

Description

As described in #5559, our submission and surface synchronization was totally messed up. I've revamped it, made smarter classes to help keep all the bookkeeping straight, and generally make the code cleaner and easier to understand.

Testing

works on my machine lel

cwfitzgerald avatar May 08 '24 22:05 cwfitzgerald

Works fine for me running the cube example with VK validation layers enabled

[2024-05-09T06:03:33Z INFO  wgpu_core::instance] Adapter Vulkan AdapterInfo { name: "Intel(R) Arc(TM) Graphics", vendor: 32902, device: 32085, device_type: IntegratedGpu, driver: "Intel Corporation", driver_info: "101.5382", backend: Vulkan }

JMS55 avatar May 09 '24 06:05 JMS55

This fixes an annoying access violation termination (that I thought was a driver bug) on program exit after a present.

Vecvec avatar May 09 '24 07:05 Vecvec

Unfortunately, it does not fix #5644 for me. Same constant validation error on presentation:

[2024-05-09T08:16:24Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkAcquireNextImageKHR-semaphore-01779 (0x5717e75b)]
        Validation Error: [ VUID-vkAcquireNextImageKHR-semaphore-01779 ] Object 0: handle = 0xba7514000000002a, type = VK_OBJECT_TYPE_SEMAPHORE; | MessageID = 0x5717e75b | vkAcquireNextImageKHR():  Semaphore must not have any pending operations. The Vulkan spec states: If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAcquireNextImageKHR-semaphore-01779)
[2024-05-09T08:16:24Z ERROR wgpu_hal::vulkan::instance]         objects: (type: SEMAPHORE, hndl: 0xba7514000000002a, name: ?)

The actual validation error only happens with a debug build, although release feels much slower than 0.19 as well. I believe something is very slow and the errors actually only show up in debug mode.

Flamegraph of cube seems to indicate this—with present taking a huge amount of time:

flamegraph

For comparison, same drivers with 0.19:

flamegraph

Even if I enable Immediate present mode, frames seem to take twice as much time to be rendered with 0.20.

hecrj avatar May 09 '24 08:05 hecrj

@hecrj could you try again with these latest commits, someone else who can reproduce the validation error says that this fixed it.

cwfitzgerald avatar May 10 '24 23:05 cwfitzgerald

@cwfitzgerald Yep! The latest changes fix the validation errors. Thanks!

I'm still noticing a bit of worse performance in debug mode, but it doesn't seem to be there in release mode anymore. Flamegraphs look the same as before, but I imagine it's expected since now synchronization works properly and there may be more waiting. No errors in any case!

hecrj avatar May 11 '24 10:05 hecrj

I'm actually rewriting these comments as I review so if you're not attached to them then don't worry too much about fixing anything.

jimblandy avatar May 14 '24 01:05 jimblandy

Yeah that's fine - I was just trying to get something there.

cwfitzgerald avatar May 14 '24 02:05 cwfitzgerald

@cwfitzgerald What do you think of this? https://github.com/gfx-rs/wgpu/commit/da543fb51

My hope was that this would make the states of RelaySemaphores a bit clearer to a new reader. If people should not use wait unless should_wait is true, then let's actually force people to check that condition before they can even get at the value: classic Option. But it does make advance fallible and require us to pass the device.

[edit: made this a little nicer still: RelaySemaphoreState is no longer necessary]

jimblandy avatar May 17 '24 20:05 jimblandy

@cwfitzgerald Here's an even more radical suggestion: https://github.com/gfx-rs/wgpu/commit/b5d52fed45e4f89213e6fe5870e99126865711aa

Just use a single semaphore for all queue ordering.

jimblandy avatar May 17 '24 21:05 jimblandy

@jimblandy i think we can go for it, the only question is if we want to still avoid this deadlock in anv mentioned in the old code, it if we consider that old enough to ignore.

/// It would be correct to use a single semaphore there, but 
/// [Intel hangs in `anv_queue_finish`](https://gitlab.freedesktop.org/mesa/mesa/-/issues/5508).

I should have perseved that comment somehow.

cwfitzgerald avatar May 17 '24 21:05 cwfitzgerald

@cwfitzgerald:

I should have perseved that comment somehow.

Ahh, yes, I was reviewing by checking out the branch and just wandering around with rust-analyzer, so I didn't notice that the patch removed the comment. If that's the sole rationale for the entire semaphore-swapping doohickey then we had definitely better have the comment.

Kvark filed that bug in Oct 2021. That's not that long ago, so I think we'd better keep it. It sounds like a bear to debug, and I'd rather not have to debug it again.

jimblandy avatar May 18 '24 04:05 jimblandy

So I guess the remaining question is what you think of this one.

jimblandy avatar May 18 '24 04:05 jimblandy

So here's the latest version of my suggestion: https://github.com/jimblandy/wgpu/compare/hal-vk-wait-semaphore-is-option%7E2...jimblandy:wgpu:hal-vk-wait-semaphore-is-option

jimblandy avatar May 18 '24 05:05 jimblandy

Looks fine to me, could you either PR to my fork or just push to it?

cwfitzgerald avatar May 18 '24 21:05 cwfitzgerald

I still want to get a handle on the swapchain semaphores. I've got some work-in-progress docs for those too. I'll take care of this tomorrow morning.

jimblandy avatar May 20 '24 03:05 jimblandy

@cwfitzgerald I just pushed two doc commits. Could you look them over for accuracy?

jimblandy avatar May 21 '24 01:05 jimblandy

rebased on current trunk

jimblandy avatar May 23 '24 22:05 jimblandy