Vulkan-Samples icon indicating copy to clipboard operation
Vulkan-Samples copied to clipboard

Timeline semaphore sample not available on Windows

Open SaschaWillems opened this issue 1 year ago • 16 comments

Due to a kernel bug the timeline semaphore sample was disabled on windows:

Not enabled on Windows at this time due to bugs. Out-of-order submission in presentation causes kernel level issues, and need to be figured out before this sample can be enabled on Windows.

We should reevaluate this and if it works on Windows now, the sample should be enabled for that platform again.

SaschaWillems avatar Dec 27 '22 06:12 SaschaWillems

Right now, the sample is disabled using a define in the CMake file. So in order to build this on windows, you need to remove the if statement at https://github.com/KhronosGroup/Vulkan-Samples/blob/master/samples/extensions/timeline_semaphore/CMakeLists.txt#L18

And also make sure to save all your work before running it, as the initial kernel bug caused hardlocks back then.

SaschaWillems avatar Jan 09 '23 16:01 SaschaWillems

Just tried again on an up-to-date Windows 10, and the sample still hangs for me in such a manner that I can't even quit it anymore. Looks like the kernel hang is still there :(

This means we don't have a usable timeline semaphore example for windows. Maybe we should do another sample that's a bit more simplistic?

SaschaWillems avatar Jan 18 '23 19:01 SaschaWillems

I tried that on Win10, using an NVIDIA RTX A3000 Laptop GPU, and it runs fine! The window looks a bit strange (one frame out of the animation): image and it seems to wait indefinitely on resize/maximize, but besides that, it seems to work on my end.

But the validation layer tells me right on startup: Validation Error: [ VUID-VkSubmitInfo-pSignalSemaphores-03242 ] Object 0: handle = 0x612f93000000004e, type = VK_OBJECT_TYPE_SEMAPHORE; Object 1: handle = 0x1f0b732f1a0, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xdb30ee87 | vkQueueSubmit(): pSubmits[0].pSignalSemaphores[0] signal value (0x1) in VkQueue 0x1f0b732f1a0[] must be greater than current timeline semaphore VkSemaphore 0x612f93000000004e[] value (0x1) The Vulkan spec states: For each element of pSignalSemaphores created with a VkSemaphoreType of VK_SEMAPHORE_TYPE_TIMELINE the corresponding element of VkTimelineSemaphoreSubmitInfo::pSignalSemaphoreValues must have a value greater than the current value of the semaphore when the semaphore signal operation is executed (https://vulkan.lunarg.com/doc/view/1.3.231.1/windows/1.3-extensions/vkspec.html#VUID-VkSubmitInfo-pSignalSemaphores-03242)

By the way, I think that the validation layer should be active by default

asuessenbach avatar Jan 23 '23 11:01 asuessenbach

Using the validation layer from the latest public SDK (1.3.236.0), I don't even get any validation layer errors. This sample just runs on Win10 with an RTX A3000. "Only" on resizing/maximizing, the program dies on some timeout, but the validation layer then says, it's most likely a validation bug!?!: Validation Error: [ UNASSIGNED-VkSemaphore-state-timeout ] Object 0: handle = 0x612f93000000004e, type = VK_OBJECT_TYPE_SEMAPHORE; | MessageID = 0x57e65a33 | Timeout waiting for timeline semaphore state to update. This is most likely a validation bug. completed_.payload=2844 wait_payload=2845 @SaschaWillems I have no idea what might be wrong, then. By visual code inspection, everything looks ok.

Would someone else dare to run this on Win10, to get more data points? Is it just Sascha's machine where it doesn't work, or is it just mine where it's working?

asuessenbach avatar Jan 30 '23 15:01 asuessenbach

Just checked again with everything updated, using the latest Vulkan developer driver and I still get a kernel hang. Maybe it's not related to the GPU but something on the CPU side instead? I'm on an AMD Ryzen 5 3600.

SaschaWillems avatar Feb 18 '23 12:02 SaschaWillems

I've got the same issue with enabled VK_LAYER_KHRONOS_validation on Ubuntu 22.04.

vulkaninfo output:

Instance Layers: count = 11
---------------------------
VK_LAYER_INTEL_nullhw             INTEL NULL HW                                       1.1.73   version 1
VK_LAYER_KHRONOS_profiles         Khronos Profiles layer                              1.3.239  version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer                      1.3.239  version 1
VK_LAYER_KHRONOS_validation       Khronos Validation Layer                            1.3.239  version 1
VK_LAYER_LUNARG_api_dump          LunarG API dump layer                               1.3.239  version 2
VK_LAYER_LUNARG_gfxreconstruct    GFXReconstruct Capture Layer Version 0.9.16-unknown 1.3.239  version 36880
VK_LAYER_LUNARG_monitor           Execution Monitoring Layer                          1.3.239  version 1
VK_LAYER_LUNARG_screenshot        LunarG image capture layer                          1.3.239  version 1
VK_LAYER_MESA_device_select       Linux device selection layer                        1.3.211  version 1
VK_LAYER_MESA_overlay             Mesa Overlay layer                                  1.3.211  version 1
VK_LAYER_NV_optimus               NVIDIA Optimus layer                                1.3.224  version 1

Devices:
========
GPU0:
        apiVersion         = 1.3.224
        driverVersion      = 525.89.2.128
        vendorID           = 0x10de
        deviceID           = 0x1e84
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = NVIDIA GeForce RTX 2070 SUPER
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 525.89.02
        conformanceVersion = 1.3.3.1
        deviceUUID         = 206486a7-29c4-1513-f19e-f861dcdaf4dd
        driverUUID         = 5f84013f-2322-5221-9dc0-81d43e19382b

avpdiver avatar Mar 29 '23 09:03 avpdiver

@avpdiver: So you get a kernel hang on Linux? If so this may hint at a general problem with the sample, rather than an issue with isolated setups.

SaschaWillems avatar Apr 21 '23 15:04 SaschaWillems

After updating to Windows 11 I no longer get a kernel hang, but the sample still does not work. It just displays a blank window, if a toggle between other windows and back to the sample I get something displayed, but it's never updated :(

SaschaWillems avatar Apr 21 '23 15:04 SaschaWillems

I tried to debug this, and it looks like the sample only does the first submit and then gets stuck waiting for something forever. If I remove the lock guard.

If we're not able to fix this sample maybe we can add a more basic timeline semaphore sample that's guaranteed to work.

SaschaWillems avatar Apr 21 '23 16:04 SaschaWillems

@HansKristian-Work: Can you help us with this one?

SaschaWillems avatar Apr 24 '23 16:04 SaschaWillems

If it still hangs the kernel, that means Windows still has not been fixed. Not sure what I can do.

HansKristian-Work avatar Apr 24 '23 16:04 HansKristian-Work

To make the sample more robust, I'd suggest removing the wait-before-signal. That's the real problem on Windows. If there is a stalled queue (does not have to be the present queue itself) that cannot make forward progress when vkQueuePresentKHR is called, the entire system locks up.

HansKristian-Work avatar Apr 24 '23 17:04 HansKristian-Work

I am hitting seemingly the same issue in a hobby project. (wip commit)

If there is a stalled queue (does not have to be the present queue itself) that cannot make forward progress when vkQueuePresentKHR is called, the entire system locks up.

So this might occur because the semaphores wait values are invalid (i.e too high) and therefore can not render the next frame?

... Or are you talking about any vkQueuePresentKHR waiting for a stalled vkQueueSubmit that is again waiting on a timeline semaphore? I would guess this would be normal use-case of timeline semaphores?

Avokadoen avatar May 01 '23 18:05 Avokadoen

vkQueuePresentKHR waiting for a stalled vkQueueSubmit that is again waiting on a timeline semaphore

Any queue, yes. The behavior seems to be as-if the present on Windows does a full device-wide wait-for-idle instead of just waiting for the present queue. That will indeed deadlock, but it also violates the Vulkan specification. As long as the sample ensures that all queues have forward progress at the time of Present, that seems to work around it.

HansKristian-Work avatar May 01 '23 22:05 HansKristian-Work

Thanks for the clarification. Do we know if this will be fixed any time soon?

Avokadoen avatar May 02 '23 11:05 Avokadoen

Do we know if this will be fixed any time soon?

No idea. It's been known for years at this point.

HansKristian-Work avatar May 02 '23 11:05 HansKristian-Work