gpuweb
gpuweb copied to clipboard
Resource copying/clearing/updating investigations
Native APIs provide different constraints and features when it comes to resource copies and clears, where resources can be buffers or images. In this issue, we'll try to find a common ground (a least common denominator API) that is usable and efficient on all backends.
In Metal, all of the copy/clear operations are done via the MTLBlitCommandEncoder
.
In Vulkan, these are transfer operations, supported on any queue type. They require TRANSFER_SRC
flag on the source and TRANSFER_DST
flag on the destination.
Operation table
operation/backend | Vulkan | D3D12 | Metal |
---|---|---|---|
clear buffer | vkCmdFillBuffer | views only with ClearUnorderedAccessView* | nothing |
clear image | vkCmdClearColorImage, vkCmdClearDepthStencilImage | views only with ClearRenderTargetView, ClearDepthStencilView | nothing |
update buffer | vkCmdUpdateBuffer, limited to 64k | nothing | nothing |
update image | nothing | nothing | nothing |
buffer -> buffer | vkCmdCopyBuffer | CopyBufferRegion | copy |
buffer -> image | vkCmdCopyBufferToImage | CopyTextureRegion | copy |
image -> buffer | vkCmdCopyImageToBuffer | CopyTextureRegion | copy |
image -> image | vkCmdCopyImage | CopyTextureRegion | copy |
image blit | vkCmdBlitImage | nothing | generateMipmaps |
Buffer Updates
In D3D12, the only way to update a buffer with new data coming from CPU is to use a staging buffer (that is mapped, filled, then copied to the destination).
In Metal, similar effect can be achieved by creating a buffer with makeBuffer that re-uses the existing storage.
In Vulkan, the implementation may have a fast-path for small buffer updates by in-lining the data right into the command buffer space. The implementation can fall back to a staging-like scheme for larger updates.
Image Blitting
Image blits are different from image copies for allowing format conversion and arbitrary scaling with filtering. A typical use case for blitting is mipmap generation. It is not clear to me why/how Vulkan provides this on a transfer-only queue, but other APIs are far more (and reasonably) limited with regards to where and how they can blit surfaces.
Alignment rules
Vulkan
VkPhysicalDeviceLimits has optimal alignments for buffer data when transferring to/from image:
-
optimalBufferCopyOffsetAlignment
is the optimal buffer offset alignment in bytes forvkCmdCopyBufferToImage
andvkCmdCopyImageToBuffer
-
optimalBufferCopyRowPitchAlignment
is the optimal buffer row pitch alignment in bytes forvkCmdCopyBufferToImage
andvkCmdCopyImageToBuffer
These are not enforced by the validation layers but are recommended for optimal performance.
D3D12
MSDN section lists the following restrictions:
- linear subresource copying must be aligned to
D3D12_TEXTURE_DATA_PLACEMENT_ALIGNMENT
(512) bytes - row pitch aligned to
D3D12_TEXTURE_DATA_PITCH_ALIGNMENT
(256) bytes
Proposed API
Clears
D3D12 model appears to be the least common denominator. If we have the concept of views, we can have API calls to clear them. In Vulkan, these calls would trivially translate into direct clears. In Metal, we'd need to run a compute shader to clear the resources. Supporting multiple cear rectangles seems to complicate this scheme quite a bit, so I suggest only doing the full-slice clears.
Updates
Given the limited support of resource updates, I suggest not providing this API at all in favor of requiring the user to use staging resources manually.
Copies
All 3 APIs appear to provide the copy capability between buffers and textures. The difference is mostly about the alignment requirements. I suggest having device flags to the minimum offset/pitch required:
- D3D12: equal to D3D12 constants
- Vulkan: equal to optimal alignment features
- Metal: some reasonable default selected by Apple
Blits
D3D12 doesn't support any sort of blitting, I'm inclined to propose no workarounds here. Users doing simple render passes for blitting textures shouldn't be slower than emulating this in the API, anyway.
Afterword
This analysis may be incomplete, corrections are welcome to go directly as the issue edits.
In Vulkan, these are transfer operations, supported on any queue type
To clarify, clear commands are not supported on transfer queues even though they count as transfer operations. vkCmdClearColorImage
requires graphics or compute queues, vkCmdClearDepthStencilImage
requires graphics support.
Note: Metal's texture-to-texture copies don't specify the destination size, and nowhere it is said to be required to match the source size. I assume Metal scales the result to fit the whole destination slice, but it would be great to get clarification from Apple.
I don't believe there is any scaling. It's just a copy of that rectangle into the destination, at a specified origin.
@grorg thanks for clarification! I removed the note from the body now.
It seems a little strange to me that Metal doesn't provide scaling for blits yet has a generateMipmaps
routine.
Thanks for the nice analysis! I'd like to point out that in NXT we have found a way to abstract the D3D12_TEXTURE_DATA_PLACEMENT_ALIGNMENT
requirement by splitting copies in two parts if needed. The code handling this can be found here. It is covered by extensive test so we are confident it works (but is only implemented for 2D texture though).
With respect to the proposed API:
- Having no way to do "updates" would work but we have found it extremely useful to have an immediate
nxtBufferSetSubData
for tests (not even a queued operation). If we go with no updates someone will need to make a "blessed" helper library to do good enough buffer updates (oustide of compute / render passes of course). - No blits sounds ok.
- Copies: this ties into another topic; ideally there would be default constraints that are validated and work on all platforms. Additionally we could provide a way for an application to discover smaller constraints and explicitly require it at device creation time. From our experiments we think only the rowPitch constraint will need to be present (and maybe image height for 2D arrays / 3D textures).
- Clears: if the clears are done outside of compute / render passes, then they could be emulated with empty
MTLRenderCommandEncoder````. I'd be interested in knowing why Metal didn't find it necessary to allow clears in
MTLBlitCommandEncoder```.
It seems a little strange to me that Metal doesn't provide scaling for blits yet has a generateMipmaps routine.
It sounds like it could be a built-in compute shader.
For any kind of resource (texture or buffer) easy clearing would be highly desirable for compute shader usecases which don't have any way of doing clear via new RenderPasses. I hit this in my wgpu(-rs) based fluid sim quite a bit where I have accumulating or temp targets (either buffer or volume textures) that need periodical clearing. Another more common usecase would be e.g. a histogram that is computed per frame using a compute shader - every frame the buffer for the histogram needs clearing.
Today, all of these cases required specialized clear passes and in many cases (even more to the annoyance of any users) bindgroups, layouts etc. I think ideally webgpu would land at a clear function on the encoder for both textures and buffers, comparable to the copy texture/buffer methods it already provides
Today, all of these cases required specialized clear passes and in many cases (even more to the annoyance of any users) bindgroups, layouts etc.
Can you explain how it requires bindgroups layouts, etc? beingRenderPass
for clearing doesn't need them. Or do you want to clear a texture inline in a compute pass? If that's the case then you can do it with your own dispatches and encapsulate them in a function. I agree that the implementation could do it for you, but that seems like something we can add later (so we can ship the first version of WebGPU faster).
Yes, I was referring to the usecase of having a dedicated compute pass. For everything where a beginRenderPass works, a separate clear function isn't truly needed anyways.
It's ofc true that a user can encapsulate something like this in a function but this way of clearing requires a lot of book keeping that the clear functions of dx12/vulkan don't need: For every type (image format, image dimension etc.) there i a special compute pipeline needed. Every single resource that needs cleaning also needs a bind group just containing this one resource.
@Wumpf would you be willing to write down a more concrete proposal (in a separate issue), with the following:
- description of the use cases. I.e. clearing buffer and texture data before doing compute passes with them, where
RENDER_ATTACHMENT
is not even needed otherwise. - the implementation paths and costs on each platform. Comparison to what the user could do on their own.
- suggested API
metal has BlitCommandEncoder::fillBuffer function