gpuweb icon indicating copy to clipboard operation
gpuweb copied to clipboard

Command Queue investigations

Open kdashg opened this issue 8 years ago • 9 comments

Glossary

Instance (VkInstance, IDXGIFactory, -) Adapter (VkPhysicalDevice, IDXGIAdapter, -) Device (VkDevice, ID3D12Device, MTLDevice) Command Queue {VkQueue, ID3D12CommandQueue, MTLCommandQueue} Command Buffer (VkCommandBuffer, ID3D12CommandList, MTLCommandBuffer) Render Pass (VkRenderPass, -, MTLRenderCommandEncoder)

Queues

Command Queue objects exist in all three APIs: VkQueue, ID3D12CommandQueue, and MTLCommandQueue. Command Queues are specific to Contexts. Command Buffers are submitted via Command Queues. In Metal, Command Buffers are created from Command Queues, whereas in Vulkan and D3D12 they are created from Devices, and submitted to a same-Device Command Queue later.

While queues in D3D12 and Metal are created as-needed, queues in Vulkan are created at during device creation. vkCreateDevice takes an array of VkDeviceQueueCreateInfos, which respectively specify the number of queues to create for each "queue family". Queue families are enumerated via vkGetPhysicalDeviceQueueFamilyProperties(VkPhysicalDevice), which details the VkQueueFlags supported by that queue family. A device's queues are then retrieved with vkGetDeviceQueue(device, queueFamilyIndex, queueIndex).

Command Queue types

There are three types of commands in each API:

  • Graphics (Graphics, Direct, Render)
  • Compute
  • Transfer (Transfer, Copy, Blit)

Compute and Transfer don't seem to be fleshed out in D3D12 yet? (ID3D12GraphicsCommandList is the only documented child of ID3D12CommandList)

D3D12

Each ID3D12CommandQueue has a single D3D12_COMMAND_LIST_TYPE:

  • D3D12_COMMAND_LIST_TYPE_DIRECT (Graphics+Compute+Transfer)
  • D3D12_COMMAND_LIST_TYPE_COMPUTE (Compute+Transfer)
  • D3D12_COMMAND_LIST_TYPE_COPY (Transfer)

Vulkan

Each VkQueue may support an or'd combination of flags (VkQueueFlags) in VkQueueFamilyProperties::queueFlags:

  • VK_QUEUE_GRAPHICS_BIT
  • VK_QUEUE_COMPUTE_BIT
  • VK_QUEUE_TRANSFER_BIT

Very relevant: "If an implementation exposes any queue family that supports graphics operations, at least one queue family of at least one physical device exposed by the implementation must support both graphics and compute operations."

"All commands that are allowed on a queue that supports transfer operations are also allowed on a queue that supports either graphics or compute operations. Thus, if the capabilities of a queue family include VK_QUEUE_GRAPHICS_BIT or VK_QUEUE_COMPUTE_BIT, then reporting the VK_QUEUE_TRANSFER_BIT capability separately for that queue family is optional."

Metal

The command types are surfaced as three distinct MTL*CommandEncoder interfaces, created from the following MTLCommandBuffer methods:

  • makeRenderCommandEncoder() -> MTLRenderCommandEncoder (Graphics)
  • makeComputeCommandEncoder() -> MTLComputeCommandEncoder
  • makeBlitCommandEncoder() -> MTLBlitCommandEncoder (Transfer)

Command Buffers

Due to the differences in command buffer submission for Metal, I'll delve into Command Buffers a bit.

Command Buffer creation

  • vkAllocateCommandBuffers(VkCommandPool)
  • ID3D12Device::CreateCommandList(D3D12_COMMAND_LIST_TYPE, ID3D12CommandAllocator)
  • MTLCommandQueue::makeCommandBuffer()

Command Buffer recording

Begin:

  • vkBeginCommandBuffer(VkCommandBuffer)
  • (implicit with CreateCommandList)
  • (implicit with makeCommandBuffer) End:
  • vkBeginCommandBuffer(VkCommandBuffer)
  • ID3D12GraphicsCommandList::Close()
  • (implicit with makeCommandBuffer) Reset:
  • vkResetCommandBuffer(VkCommandBuffer)
  • ID3D12GraphicsCommandList::Reset()
  • (unsupported)

Command Buffer submission

  • vkQueueSubmit(VkQueue, VkSubmitInfo{ VkCommandBuffer[] }[])
  • ID3D12CommandQueue::ExecuteCommandLists(ID3D12CommandList[])
  • MTLCommandBuffer::enqueue()

Rough skeletons:

Vulkan: device = vkCreateDevice(VkInstance, VkDeviceQueueCreateInfo[]) commandQueue = vkGetDeviceQueue(device) commandBuffer = vkAllocateCommandBuffers(device) // ... vkResetCommandBuffer(commandBuffer) vkBeginCommandBuffer(commandBuffer vkCmdBeginRenderPass(commandBuffer, VkRenderPass, VkFramebuffer) vkCmdDraw(commandBuffer) vkCmdEndRenderPass(commandBuffer) vkEndCommandBuffer(commandBuffer) vkQueueSubmit(commandQueue, commandBuffer)

D3D12: commandQueue = device.ID3D12Device::CreateCommandQueue() commandBuffer = device.ID3D12Device::CreateCommandList() // ... commandBuffer.ID3D12GraphicsCommandList::Reset() // implicit Render Pass commandBuffer.ID3D12GraphicsCommandList::DrawInstanced() commandQueue.ID3D12CommandQueue::ExecuteCommandLists(commandBuffer)

Metal: commandQueue = device.MTLDevice::makeCommandQueue() // ... commandBuffer = commandQueue.MTLCommandQueue::makeCommandBuffer() // potentially commandBuffer.MTLCommandBuffer::enqueue() already renderPass = commandBuffer.MTLCommandBuffer::makeRenderCommandEncoder() renderPass.MTLRenderCommandEncoder::drawPrimitives() renderPass.MTLCommandEncoder::endEncoding() commandBuffer.MTLCommandBuffer::commit()

Fences

Vulkan

VkFence

Signals host from GPU.

"Fences are a synchronization primitive that can be used to insert a dependency from a queue to the host. Fences have two states - signaled and unsignaled. A fence can be signaled as part of the execution of a queue submission command. Fences can be unsignaled on the host with vkResetFences. Fences can be waited on by the host with the vkWaitForFences command, and the current state can be queried with vkGetFenceStatus."

VkSemaphore

Signals command buffer from command buffer.

"Semaphores are a synchronization primitive that can be used to insert a dependency between batches submitted to queues. Semaphores have two states - signaled and unsignaled. The state of a semaphore can be signaled after execution of a batch of commands is completed. A batch can wait for a semaphore to become signaled before it begins execution, and the semaphore is also unsignaled before the batch begins execution."

VkEvent

Signals queue from queue or host.

"Events are a synchronization primitive that can be used to insert a fine-grained dependency between commands submitted to the same queue, or between the host and a queue. Events have two states - signaled and unsignaled. An application can signal an event, or unsignal it, on either the host or the device. A device can wait for an event to become signaled before executing further operations. No command exists to wait for an event to become signaled on the host, but the current state of an event can be queried."

D3D12

ID3D12Fence

Signals host or GPU from GPU

  • Set to a value with Signal(u64)
  • Polled with GetCompletedValue()->u64 ID3D12Fence::SetEventOnCompletion(UINT64 Value, HANDLE hEvent) ID3D12Device1::SetEventOnMultipleFenceCompletion(ID3D12Fence[] fences, UINT64[] vals, HANDLE hEvent) ID3D12CommandQueue::Signal(ID3D12Fence, u64) ID3D12CommandQueue::Wait(ID3D12Fence, u64)

Windows Event

D3D12 uses Windows Events for the host side of gpu->host synchronization.

Metal

MTLFence

Signals command buffer from command buffer.

  • Created from MTLDevice
  • Signaled from encoder
  • Waited on by encoder

"Drivers may delay fence updates until the end of the command encoder; drivers may also wait for fences at the beginning of a command encoder. Therefore, you are not allowed to wait on a fence after it has been updated in the same command encoder."

MTLCommandBuffer

MTLCommandBuffer::add{Scheduled,Completed}Handler(MTLCommandBufferHandler) MTLCommandBuffer::waitUntil{Scheduled,Completed}()

MTLCommandBufferHandler is "A block of code to be invoked".

Equivalency

  • Host can Poll/Wait on GPU with VkFence/ID3D12Fence::SetEventOnCompletion/MTLCommandBuffer::{addHandler,waitUntil}().
  • Command Buffer can Wait on Command Buffer with VkSemaphore/ID3D12CommandQueue::Wait/MTLFence
  • Host can delay Command Buffer execution by: -- VkEvent -- Not submitting Command Buffers

Proposal

Vulkan/D3D12-style queues, which are readily implementable on Metal.

D3D12 generally uses u64 signals instead of boolean signals in Vulkan and simple signal/wait in Metal. Metal's semantics are readily implementable on the others, but are less sophisticated in comparison.

VkFence is implementable on ID3D12Fence, and more coarsely on MTLCommandBuffer::add*/wait*. VkSemaphore is implementable on ID3D12CommandQueue::Signal/Wait and MTLFence. VkEvent is emulatable on D3D12 and Metal, but does not have a direct equivalent.

kdashg avatar Jun 14 '17 21:06 kdashg

Thanks @jdashg for the extensive investigation!

Fences (GPU->CPU synchronization) work differently in the native APIs but they can all be emulated on top of each other. Personally I find the D3D12 style of updating an u64 very appealing for keeping track of resources with some sort of serial number, and would favor this approach.

My understanding of VkEvent is that it allows to do "split memory barriers" which can be optimized out by the driver or scheduled slightly more efficiently. I believe this is only useful for the last couple % of performance and wouldn't try to fit this usecase in WebGPU. Also I don't see a use-case for signaling the CPU in the middle of a command-buffer so I suggest not having an equivalent of VkEvent.

If synchronization between queues is left for the application to do, then some synchronization primitive will be needed. The way D3D12 Signal and Wait interacts with Fences is nice, but I'm not sure how efficiently it can be implemented on other APIs. However the point of view we have for NXT is that the implementation should insert the correct barriers and synchronization automatically and have the application "transition" resources from one queue to another between submits.

Having command buffers created from the device instead of queues sound good, especially since in Metal there would be only one queue backing all operations.

Kangz avatar Jun 15 '17 19:06 Kangz

Observation: the discussion seems to concentrate around synchronization rather than queues.

@Kangz

Fences (GPU->CPU synchronization) work differently in the native APIs but they can all be emulated on top of each other.

Could you provide more details on that?

I find the D3D12 style of updating an u64 very appealing for keeping track of resources with some sort of serial number, and would favor this approach

How would this map to Vulkan, where fences are just boolean?

My understanding of VkEvent is that it allows to do "split memory barriers" which can be optimized out by the driver or scheduled slightly more efficiently. I believe this is only useful for the last couple % of performance and wouldn't try to fit this usecase in WebGPU.

I'd like this to be investigated further, but it would be great to defer shipping events after MVP.

kvark avatar Jun 16 '17 00:06 kvark

For the MVP, is there a reason not to implement a heavily restricted version of Vulkan's enumeratePhysicalDevices, getQueueFamilyProperties (using a physical device) and getDeviceQueue (using a queue family) to avoid future API changes? I mentioned this briefly in https://github.com/google/nxt-standalone/issues/35#issuecomment-307821413

My thought is that some restrictions could be added (to keep implementations simple) so the API still remains flexible enough to satisfy post-MVP capabilities. For example, if the MVP restricts this API such that:

  • exactly one physical device must be returned
  • exactly one* queue must be requested during logical device creation (at device creation due to Vulkan)
  • exactly one** fully-capable queue family must be returned
  • exactly one* queue may be created per queue family

Then at least the API doesn't require large changes post-MVP if these multi-device/queue capabilities are added. Obviously developers could get in the habit of doing something like enumeratePhysicalDevices()[0] or getDeviceQueue(0 /* first family */, 0) which is clearly bad. However at least post-MVP the implementation could choose the order when multiple devices/queue families are supported, in order to minimize the potential of breaking existing MVP-based applications.

* The restriction could be changed from one queue to multiple if we wanted to have separate queues for graphics/compute/blit or some other combination. ** Alternatively separate queue families could be returned for each type, and exactly one queue of each type could be requested.

grovesNL avatar Jun 22 '17 00:06 grovesNL

Fences (GPU->CPU synchronization) work differently in the native APIs but they can all be emulated on top of each other.

Could you provide more details on that?

Metal -> Vulkan / D3D12 looks easy: you just need to update the fence status / value in the callback. Vulkan / D3D12 -> Metal just needs the backend to periodically check the fence status / value and call the callbacks when needed. D3D12 -> Vulkan is trivial, so the only thing left is Vulkan -> D3D12.

I find the D3D12 style of updating an u64 very appealing for keeping track of resources with some sort of serial number, and would favor this approach

How would this map to Vulkan, where fences are just boolean?

Every time a WebGPU fence is scheduled to be updated on a queue, create or recycle a VkFence, enqueue on the WebGPU queue and add it to the WebGPU fence state tracking's queue<pair<Value, VkFence>>. When the equivalent of ID3D12Fence::GetCompletedValue is called, check the VkFences in order and is the value for the latest passed VkFence.

Kangz avatar Jun 27 '17 20:06 Kangz

To followup from the call today, is there a strong reason to prefer requesting flags (with device creation failure for unsupported flags) versus querying (like the Vulkan API)? Passing flags at device creation without being able to query the device properties appears to be worse for portability -- on device creation failure the application would have to guess which flag caused the failure and retry (possibly repeatedly).

I understand the hesitancy to support querying for portability concerns, but in this case it seems worthwhile in order to avoid command queue virtualization (unknowingly to the application) in some cases. The "baseline" (as mentioned today) for implementations could just be a single, fully-capable queue family (and queue) as I described above. Many applications could always prefer the fully-capable queue family and there shouldn't be any portability issues because of this.

grovesNL avatar Jun 28 '17 23:06 grovesNL

Just to record the consensus, during the 2017-06-28 meeting we agreed that:

  • Exposing at least async compute queues is important
  • The application should request queues at device creation time with a list of flags specifying what should be the capabilities of the queues.
  • The number of supported async compute queues could default to 0. (implicitly there would be at least one graphics | compute queue)

@grovesNL: If queues are created with flags, the application should be able to discover what the constraints on these flags are, before device creation. However I don't think mirroring 100% of the Vulkan API would help:

  • The concept of queue family is only present in Vulkan, Metal essentially has a single queue, and in D3D12 you just specify "GRAPHICS" or "COMPUTE".
  • In Vulkan queue family 0 isn't guaranteed to be the universal queue, so applications would have to search through the returned list.
  • I believe that it would be fine and cheap to emulate async-compute queues on top of a universal queue, if you force the app to submit to queues in an order that can be serialized in one queue. In which case we could expose it by default on all configurations.

Kangz avatar Jul 10 '17 20:07 Kangz

Also I don't see a use-case for signaling the CPU in the middle of a command-buffer so I suggest not having an equivalent of VkEvent.

That's incredibly useful and we use it all the time.

Would you mind providing some examples of how you use it? It would help us motivate its inclusion.

On Wed, Nov 14, 2018, 10:14 AM Mateusz Kielan <[email protected] wrote:

Also I don't see a use-case for signaling the CPU in the middle of a command-buffer so I suggest not having an equivalent of VkEvent.

That's incredibly useful and we use it all the time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gpuweb/gpuweb/issues/22#issuecomment-438762358, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlAkwu6kE6Y_ZYMCiJ7GDIGKj4HRcODks5uvFz8gaJpZM4N6dKS .

kainino0x avatar Nov 15 '18 04:11 kainino0x

Oh, I see you did so on #38.

On Wed, Nov 14, 2018, 8:09 PM Kai Ninomiya <[email protected] wrote:

Would you mind providing some examples of how you use it? It would help us motivate its inclusion.

On Wed, Nov 14, 2018, 10:14 AM Mateusz Kielan <[email protected] wrote:

Also I don't see a use-case for signaling the CPU in the middle of a command-buffer so I suggest not having an equivalent of VkEvent.

That's incredibly useful and we use it all the time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gpuweb/gpuweb/issues/22#issuecomment-438762358, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlAkwu6kE6Y_ZYMCiJ7GDIGKj4HRcODks5uvFz8gaJpZM4N6dKS .

kainino0x avatar Nov 15 '18 04:11 kainino0x