computecullandlod - memory barrier ignored??
In buildComputeCommandBuffer() the vkCmdDispatch() call is sandwiched between two vkCmdPipelineBarrier() calls with alternating source and destination AccessMask, QueueFamilyIndex and StageMasks for the indirectCommandsBuffer. While this is meant for acquire/release operation, the validation reports:
- in the first call to
vkQueueSubmit()on compute.queue indraw():
ERROR: [-1725967473][UNASSIGNED-VkBufferMemoryBarrier-buffer-00004] : Validation Error: [ UNASSIGNED-VkBufferMemoryBarrier-buffer-00004 ] Object 0: handle = 0x18d371af060, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x991fd38f | vkQueueSubmit(): in submitted command buffer VkBufferMemoryBarrier acquiring ownership of VkBuffer (VkBuffer 0x612f93000000004e[]), from srcQueueFamilyIndex 0 to dstQueueFamilyIndex 2 has no matching release barrier queued for execution.
- in the subsequent calls to
vkQueueSubmit()on compute.queue indraw()
WARNING: [-882403456][UNASSIGNED-VkBufferMemoryBarrier-buffer-00003] : Validation Warning: [ UNASSIGNED-VkBufferMemoryBarrier-buffer-00003 ] Object 0: handle = 0x18d371af060, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xcb679780 | vkQueueSubmit(): VkBufferMemoryBarrier releasing queue ownership of VkBuffer (VkBuffer 0x612f93000000004e[]), from srcQueueFamilyIndex 2 to dstQueueFamilyIndex 0 duplicates existing barrier queued for execution, without intervening acquire operation. ERROR: [-1725967473][UNASSIGNED-VkBufferMemoryBarrier-buffer-00004] : Validation Error: [ UNASSIGNED-VkBufferMemoryBarrier-buffer-00004 ] Object 0: handle = 0x18d371af060, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x991fd38f | vkQueueSubmit(): in submitted command buffer VkBufferMemoryBarrier acquiring ownership of VkBuffer (VkBuffer 0x612f93000000004e[]), from srcQueueFamilyIndex 0 to dstQueueFamilyIndex 2 has no matching release barrier queued for execution.
It's as if the second vkCmdPipelineBarrier() has no effect and totaly ignored.
To complicate matters, commenting out both vkCmdPipelineBarrier() calls OR using VK_QUEUE_FAMILY_IGNORED on source and destination QueueFamilyIndex silences the validation layer.
Validation in that sample is pretty broken and something I want to rework. And while removing the barriers may silence the validation layers it will probably break the sample on some of the more strict Vulkan implementations.
For anyone who comes across this in a Google search while struggling to get compute shader frustum culling to work,... well, that's how I got here, and to the best of my knowledge the problem is that this is only half of what's called a queue family ownership transfer. Apparently, VkBuffers and VkImages and whatnot can only be used from one queue family (graphics, compute, transfer, etc.) at a time, unless they're created using VK_SHARING_MODE_CONCURRENT, which is less performant. If you need to use one of these resources in multiple queues, without discarding all of its data as it moves from one queue to another, then you need to perform a queue family ownership transfer.
The purpose of the pipeline barrier commands (using VkBufferMemoryBarriers) is to release and acquire ownership. The problem with the example is that when we record compute commands, we'll acquire and then release ownership... but we don't do the same when recording graphics commands. The compute queue lobs the buffer over to the graphics queue across the room, but the graphics queue isn't ready to catch it and just gets donked on the side of the head, right? So we need to have the graphics command buffers also acquire and release ownership. Of course, fixing this isn't as simple as just putting pipeline barriers immediately before and after the draw-indirect call, because...
- In order to use
vkCmdPipelineBarrierwithin a render pass, the subpass during which that API is called must have a self-dependency -- that is, aVkSubpassDependencywhere the source and destination subpass are the same... - However, compute shaders in general can't run during a render pass, and it's a compute stage that we're trying to transfer to from...
- Therefore, the graphics-side
vkCmdPipelineBarriercalls must be made from outside of a render pass -- so, before starting and after ending the relevant render pass.
There may be more fixes needed -- like I said, I can't seem to get compute shader frustum culling to work in my own renderer, and this hasn't changed that -- but this is what I've been able to come up with re: memory barriers after... what, an hour? two? of searching?
Sorry to hear. As noted above, the sample is in dire need for proper synchronization. I'm completely reworking sync in a new branch, and hopefully will have a fix for that sample in there too. No ETA yet, though, as it's a lot of work.
Can you talk a look at the updated code? A recent PR did some changes to how barriers are setup in this sample: https://github.com/SaschaWillems/Vulkan/commit/b2f501dc98c967ec5f49d2e47d4f4975753b2a48#diff-0e98c00a2325f74a48a971d0674f7d28c56d895b51a833a2e53eafddc336af48
Sync validation is clean now.
When I was studying these samples, the solution I found did not involve changing the underlying queue infrastructure.
For example in computeCloth, in addComputeToGraphicsBarriers() I used:
VkBufferMemoryBarrier
srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT
dstAccessMask = VK_ACCESS_MEMORY_READ_BIT
srcQueueFamilyIndex = vulkanDevice->queueFamilyIndices.compute;
dstQueueFamilyIndex = vulkanDevice->queueFamilyIndices.graphics;
vkCmdPipelineBarrier(...
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,...)
Quick follow up to this discussion...
Before submitting the PR with barrier changes, I experimented with using VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT as above, since it works in both graphics and compute queues without validation errors. However, I was a bit concerned with using it in the graphics queue for a buffer release, i.e. within addGraphicsToComputeBarriers() when finalizing the graphics command buffer. Since the Draw Indirect stage occurs before the Vertex Input stage in the Vulkan pipeline, I was concerned the release would happen too quickly, i.e. before the Vertex Input processing stage was complete. In addition, by using VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT in the graphics queue/command buffer, we would be assuming that the graphics queue also supports compute. This is probably the case 99% of the time, but perhaps there could be cases where this is not true.
The solution I eventually chose comes from the Vulkan Synchronization Examples (Legacy synchronization APIs). See https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples-(Legacy-synchronization-APIs). The example shows sync between a Graphics Queue and Transfer Queue, but the pattern is the same for Graphics and Compute. See the example called "Command Buffer Recording and Submission for a unified transfer/graphics queue." The name is a bit misleading since it covers both the unified and separate queues cases.
From what I can see, the advantages of using this pattern is that:
- It completely separates graphics queue-specific and compute queue-specific pipeline stages, so they do not have to be mixed into a single pipeline barrier call. The graphics pipeline barrier call uses only graphics pipeline stages, and the compute barrier call uses only compute pipeline stages. Because of this you can specify the exact pipeline stages you are actually using without any compromises or validation errors.
- It makes no assumption regarding compute capabilities in the graphics queue.
- You can use the proper srcAccessMask and dstAccessMask for the graphics and compute queues without needing to use generic options like
VK_ACCESS_MEMORY_WRITE_BITandVK_ACCESS_MEMORY_READ_BIT.
The only issue with implementing this approach was that for the computecloth example it required parameterizing the addComputeToGraphicsBarriers() and addGraphicsToComputeBarriers() functions. The parameters required for release and acquire are now dependent on which queue/command buffer you are referring to. For the other compute* examples I used the same pattern but without the above helper functions, i.e. the barriers were implemented inline.
I hope this explains why I made the changes I did. Please let me know if I have made any incorrect assumptions or if you have any suggestions for changes / improvements.