Background

WebGPU currently has 2 buffer upload facilities: GPUBuffer.mapAsync() and GPUBuffer.writeBuffer().

For GPUBuffer.mapAsync(), there currently is the restriction that no mappable buffer can be used as anything else, other than COPY. This means that, in order to be useful, an application has to allocate 2 buffers - one for mapping and one for using. And, if an application wants to round-trip data through a shader, they have to allocate 3 buffers - one for the upload, one for the download, and one for the shader. Therefore, in order to use mapAsync(), an application needs to double (or triple) their memory use and add one or two extra copy operation. On a UMA system, neither the extra allocation nor the copy is necessary, which means there's both a perf and memory cost to using mapAsync() on those systems. What's more, because the application is explicitly writing this code, there's not really anything we can do to optimize out the extra buffer allocation / copy operation.

On the other hand, GPUQueue.writeBuffer() is associated with a particular point in the queue's timeline, and therefore can be called even when the destination buffer is in-use by the GPU. This means that the implementation of writeBuffer() is required to copy the data to an intermediate invisible buffer under-the-hood, even on UMA systems, and then schedule a copy operation on the queue to copy the data from the intermediate buffer to the final destination. This extra allocation and extra copy operation don't necessarily need to exist on UMA systems. (GPUBuffer.writeBuffer() is a good API in general because of its simple semantics and ease of use, but it does have this drawback.)

It would be valuable if we could combine the best parts of GPUBuffer.mapAsync() and GPUQueue.writeBuffer() into something which doesn't require an extra allocation or copy on UMA systems. This kind of combination would have to be something that isn't UMA-specific, but would work on both UMA and non-UMA, and UMA systems would be able to avoid extra allocations/copies under-the-hood.

Goals

The "async" part of GPUBuffer.mapAsync() would be valuable, because that allows the implementation to not have to stash any data due to the destination buffer being busy.
The "map" part of GPUBuffer.mapAsync() would be valuable because it allows the array buffer to be backed directly by GPU memory, thereby potentially avoiding another copy on UMA systems.
The "queue" part of GPUQueue.writeBuffer() would be valuable, because non-UMA systems would need to schedule an internal copy to the destination, and specifying the queue gives them a place to do that.

Proposal

I think the most natural solution to this would be:

Give mapAsync() an extra GPUQueue argument. (getMappedRange() and unmap() will implicitly use this queue). We could also say that the queue is optional, and if it's unspecified, the device's default queue will be used instead.
Relax the requirement that the only other usage a mappable buffer can have is COPY

That's it!

On a UMA system, you'd be able to map the destination (storage) buffer directly - No copies, no extra allocations, it's living the UMA dream.
- For reading, mapAsync() would just ignore its GPUQueue argument.
- For writing, mapAsync() would use its GPUQueue argument to schedule a clearBuffer() command of the relevant region of the buffer. After the clear operation is complete, the map promise would be resolved.
On a non-UMA system:
- For reading, mapAsync() would schedule a copy from the source (storage) buffer to a temporary buffer using the specified GPUQueue (which is what an application would have had to do themself), and the map operation would proceed just like normal on the temporary buffer. This is exactly what an author would have had to do themself.
- For writing, mapAsync() would just stash the queue, map a temporary buffer, and wait for unmap() to be called. When unmap() is called, it would schedule a copy on the stashed queue from the temporary buffer to the destination buffer. This is exactly what an author would have had to do themself.

It's important to note is that this proposal doesn't restrict the amount of control a WebGPU author has. If an author wants to allocate their own map/copy buffer and explicitly copy the data to/from it on its way to its final destination (as they would do today), they can still do that, and no invisible-under-the-hood temporary buffers would be allocated.

This proposal also has a natural path forward for read/write mapping.

Dec 07 '21 01:12 litherum

Overall I'm worried about modifying the buffer mapping mechanisms this late, when it was perhaps the single most difficult part of the API to find a good design for and reach consensus on. I think a UMA optional feature would make a

We discussed the slight inefficiencies that UMA has with the current mapping mechanism multiple times in F2F, and https://github.com/gpuweb/gpuweb/issues/605 suggests a UMA feature could be done to optimize this later (later could be now). The proposal is interesting but has some issues.

Side note: there is also mappedAtCreation which was added so that the initial upload of data in buffers can be made perfectly efficient on UMA systems. (up to the copies necessary for the process separation).

Multiple buffers per `GPUBuffer`

It breaks down the model that one WebGPU buffer == one underlying API buffer. This is a pretty useful thing to keep because it makes it very clear to the developer what the memory cost of things is. That you can have temporary staging for mappedAtCreation, and potentially shmem wrapped in the ArrayBuffer given to JS is already very difficult for developers to reason about in terms of cost.

Cost of consistency for `MAP_WRITE`

Currently MAP_WRITE buffers give you an ArrayBuffer that contains the current content of the buffer. Since the buffer can only be written by Javascript, no copies are ever needed to update the content of the buffer, it's just the ArrayBuffer wrapping shmem shared between the GPU and Web processes. If the GPU can write to the buffer, then we need to copy data from the UMA buffer to the shmem (or even worse, from VRAM to readback to shmem).

On the other side, if we say that mapAsync(MAP_WRITE) always zeroes the buffer, then in most cases the CPU has to zero the buffer, since the UMA/readback buffer isn't shared with the Web process. Either there's a memset(0) or a memcpy from the UMA buffer to the shmem.

Consistency for `MAP_READ`

What happens when Javascript writes into the buffer mapped for reading? Assuming you are able to create an MTLBuffer from a shmem FD to reduce the number of copies to the maximum, the writes that Javascript did all of a sudden become visible to the GPU, while an all other configurations, Javascript writing to the buffer doesn't have any visible effect for the GPU?

Relatively little gains and a feature proposal

The gains you get with the proposal you suggested seem small: if you have a large amount of data to initialize buffers with, then you can use mappedAtCreation that's the optimal path. If you need to modify part of a buffer while it's in use, then you have to schedule a copy because mapping is an ownership transfer of the full buffer (I tried to figure out how to do sub-range mapping efficiently but gave up).

So the cases this help are when you need to upload data to a buffer that's not being created, but also not currently in use by the GPU. This should be a fraction of the actual buffer transfers. Still might be worth speccing an optional feature, but not modifying the core buffer mapping spec.

The optional "UMA" feature could:

Lift the restriction for MAP_WRITE to allow any other read-only usages.
MAP_READ already allows all the write-only usages. But maybe more are added in the future so the extension would also lift that? Or it allows it with any other usages, but assumes there is always a UMA -> shmem copy happening in the GPU process (so that JS writes are never made visible to the GPU).

Dec 07 '21 17:12 Kangz

I was going to have comments but @Kangz covered everything I was going to say and more.

Dec 08 '21 02:12 kainino0x

It breaks down the model that one WebGPU buffer == one underlying API buffer.

This isn't true. This proposal requires scratch space, certainly, but so does writeBuffer(). It's no worse.

Dec 08 '21 21:12 litherum

What happens when Javascript writes into the buffer mapped for reading?

This is a good point! I suppose this proposal only makes sense for read/write buffers (which we don't have today, but I think has a natural path forward).

Dec 08 '21 21:12 litherum

It's relatively rare for an application to actually need read/write mapping. Sure we could add them for this use-case, but applications would still need to know which one to use and explicitly switch between them based on whether the adapter is UMA or not.

Dec 09 '21 01:12 kainino0x

This isn't true. This proposal requires scratch space, certainly, but so does writeBuffer(). It's no worse.

WriteBuffer is quite explicitly an implementation-managed ringbuffer. But it is not tied to a GPUBuffer, it's only GPUDevice extra memory. Plus it doesn't need to be persistent data. The implementation can destroy the ringbuffers when there is memory pressure while extra backings for GPUBuffer would have to stay, otherwise you could get an OOM trying to map the buffer.

Dec 09 '21 09:12 Kangz

I'm not saying I'm for or against any proposal here, only voicing that I agree what's in the API today does not fully satisfy workflows with dynamic data moving across host/device and that this will be a performance issue in real-world usage. In compute workloads getting data back from the device is a major part of the upload -> compute -> download flow and until we have a GPUQueue.readBuffer (🙏 please!) this results in non-trivial complexity in user code and bloat.

But it is not tied to a GPUBuffer, it's only GPUDevice extra memory.

👍 IMO having the implementation manage the ringbuffer with writeBuffer/readBuffer and incurring a copy is acceptable if the alternative is managing exclusively upload or download buffers in user code (as I found the spec detail that indicates a buffer cannot be both, resulting in user staging pools needing double the memory for bidi transfer). This way multiple libraries trying to perform upload/download are not each keeping around large GPUBuffers for this purpose.

Dec 10 '21 22:12 benvanik

you could get an OOM trying to map the buffer.

This is no worse than the possibility of an OOM when trying to writeBuffer(), though...

Dec 11 '21 00:12 litherum

Sure that's a possibility as well, although implementation could stall to free staging space if they really want to.

The point here wasn't that writeBuffer can or cannot OOM, it is that you want to give developers a way to do transfers without the possibility to trigger unfixable OOM. Buffer mapping can do that since OOM only happens at buffer creation. If you choose to make staging / readback buffer transient, then you lose this control in the application because you can OOM on mapAsync as well. It is possible to decide to do that, but need to be cognizant of all the tradeoffs we are making. In this whole comment thread I suggest it is a bad idea for many reasons, including that OOM issue.

Dec 13 '21 07:12 Kangz

I agree with the concerns about managing temporary buffers for mapping, expressed by @Kangz . Their lifetime is attached to mapping, and it's worse than the ring buffer we currently have for the writes.

I also agree with @litherum that it would be good to be able to avoid copies on systems that can do that. An optional feature for UMA architectures seems like the right way to proceed. It would basically lift the restriction on usages for buffers, allowing MAP_READ+MAP_WRITE+anything else.

As for the queue argument for mapping, this correlates with https://github.com/gpuweb/gpuweb/issues/1977#issuecomment-884436273. It's probably needed.

Dec 14 '21 16:12 kvark

I'm happy to help by writing an optional feature for UMA that allows MAP_READ + WRITE.

Dec 14 '21 18:12 Kangz

It would be pretty unfortunate if authors had to opt-in to avoid using 2x memory on UMA machines.

Dec 15 '21 01:12 litherum

I don't think anyone is disagreeing about that, but if we're going to avoid it we're going to need a proposal that works. I don't think we're getting any closer to one.

Dec 15 '21 02:12 kainino0x

WebGPU meeting minutes 2022-02-23

KN: nobody satisfied with current state, but nobody has a better idea. Everyone's resigned to this fate except Myles. :)
KG: one thing that has changed since it was first discussed - more common today than 2 years ago to get adapters that let you map CPU read and host read/device use - used to be UMA archs only, and some AMD cards - but has changed now.
KN: right. Intel doesn't even have some of these options (host-coherent + device-coherent?). Think we could do this on Intel regardless.
KG: if something we can't support - don't want to fragment the ecosystem by making you write 2 paths. If things have changed - still do need that.
KN: we don't have solution for doing this underneath the hood of the application. Can do it with an extension. Would like to. Maybe we should do it for 1.0. Would need separate code path for application.
KR: WebGL doesn't have the ability to optimize for this and performance in this area is basically fine. I think WebGPU will also perform fine in general without this optimization, and since applications will have to add a new code path to take advantage of it, think this should be pushed out to post-V1.

Mar 02 '22 00:03 kdashg

So here's the proposal for the extension: what I wrote above

The optional "UMA" feature could:

 - Lift the restriction for MAP_WRITE to allow any other read-only usages.
 - MAP_READ already allows all the write-only usages. But maybe more are added in the future so the extension would also lift that? Or it allows it with any other usages, but assumes there is always a UMA -> shmem copy happening in the GPU process (so that JS writes are never made visible to the GPU).

With the addition that if readonly ArrayBuffers become a thing, then we can lift all restrictions on MAP_READ (except MAP_WRITE? not sure), by making the ArrayBuffer returned by mappings for reading be readonly.

Mar 16 '22 15:03 Kangz

WebGPU meeting minutes 2022-03-16

Myles has been writing lots of webgpu patches instead of thinking about this; can we defer a week?
CW: UMA storage buffers - I made a proposal a couple lines long. We can have a UMA extension - enable map() with writable buffers with READ_ONLY usage. And vice versa.
CW: If later we have read-only ArrayBuffers we can have other functionality.
CW: would be nice if we were able to say - you can have any usage and it just works - but not possible on D3D, and isn't best thing to do. Lots of complexity, e.g. with discrete GPU and also cross-process. Can have a memory "thing" which spans all 3 items - GPU, GPU process, renderer process. Also need consistent behaviors on all systems. Writes to JS have to be visible to JS for readable buffers. Complicated.
CW: that's why I think only way for proper UMA support is via an extension.
MM: would this extension also be present on discrete cards? And extension would say, your writes might not be present if you read form it?
CW: no, behavior should be consistent always, regardless of extension being enabled. That's the main goal.
MM: so app needs: if (uma) { … } else { … }?
CW: yes. App can get best behavior on UMA and desktop - there are cases today where you can do the optimal path is buffer mapped at creation, or update buffer in pipelined fashion during GPU execution.
CW: case not handled: big buffer, need to change data after creation, but not always used by the GPU. Don't know when apps would do this. Useful to think of UMA extension because it helps that case. We should already be pretty optimal in most cases though.
MM: think argument makes sense. Not 100% sure I agree. First statement about 2 buffer upload mechanisms is false though - there's a third, mapAtCreation. That would work when streaming data from CPU to GPU. Not the other way around though.
MM: backward direction is definitely less common. Need to do more research.
MM: other thing - we should try to describe somewhere that mappedAtCreation's expected to be more performant than creating buffer and mapping it.
CW: should be in non-normative text at least. Brandon made a best practices doc on uploading data with WebGPU. writeBuffer - but mappedAtCreation's pretty good, too.
BJ: that doc's in flux - please suggest improvements.
MM: committed to our repro?
BJ: not yet. Not a good time.
MM: link to it please?
BJ: will do. https://github.com/toji/webgpu-best-practices/blob/main/buffer-uploads.md
CW: think everyone wants to make UMA work amazingly well. But, amazingly hard while keeping consistent behavior from JS side, and keeping D3D constraints in mind, and single source for GPUBuffer, etc. Optimizations you want to in the browser later too. Happy to discuss details with people. Wish we had a better story for UMA, but I can't find one.
MM: believe you, just don't think we should say it's impossible.
CW: also happy to discuss offline more. Maybe in office hours.

Mar 16 '22 20:03 kdashg

As discussed in the meeting, moving to post-V1 polish since the only proposal so far is an optional feature.

Apr 25 '22 22:04 Kangz

@Kangz: You appear to have a broken sentence here:

I think a UMA optional feature would make a

Jun 07 '23 16:06 ErichDonGubler

@Kangz: You appear to have a broken sentence here:

I think a UMA optional feature would make a

I think the gist was "UMA would make sense to put in an optional feature"

Jun 07 '23 21:06 kainino0x

GPU Web 2023-06-07/08 (Pacific time)

Recap the design constraints for this problem
MM: wanted to touch base before going off and doing a bunch of engineering
- We're interested in UMA working well
- Interested in a potential solution where the same code would "do the right thing" on UMA and non-UMA
- This group posited that that was not possible
- I think it might be
- Want to nail down what the original objections were
KR: from our side we need enga@ and cwallez@ present for the conversation. Would like to advance this on the Github issue or mailing list.
Postpone for a week?
KG: I can try to synthesize
KG: on non-UMA archs you sometimes need 2 copies, and on UMA you can get to 1 copy. How to pipeline, prioritizing bandwidth/latency, is where Corentin and I ran aground trying to find a single API to do both.
KG: My position - if you try to figure out the API for these things, you'll either prove us wrong or right, and that's great
MM: that's reassuring. Think we're in a different situation now than 2021. Now we have 2 ways of getting data on the card. I'd be coming back with a 3rd way. Adding a 3rd way isn't great for the platform, but if an app cares about the tradeoffs, we'd have more options for them.
Continue this next week.

Jun 08 '23 00:06 kdashg

The current restriction on buffers created with map flags has problems also on NUMA architectures. Small, frequently updated uniform buffers can be stored in system memory without significant impact on performance. In addition, with the advent of Resizable BAR and SAM, it is possible to write data directly to VRAM using the CPU (we can even write textures directly to VRAM and change later the access pattern from linear to swizzled, for better bandwidth)

Jun 27 '23 17:06 MikhailGorobets

GPU Web WG 2024-10-29/30 Mountain View F2F

Interested: Mike Wyrzykowski
AB: Issue bigger than just memory bandwidth. Also serious performance impact on barriers. (no one was taking notes here). End up with all-graphics–to–all-graphics barrier. 15% regression just from barriers.
…
AB: If we know the buffer is not scheduled for use, but it's mappable (in vulkan terms, which is basically all buffers on UMA), and any gpu writes are visible to the host, we know we can skip the upload buffer and write directly. For all of these conditions to be true, there is a hidden fast path where developers have to keep one buffer per frame in flight.
CW: This is why it gets a star. Being in the browser makes it more complicated…
JS: Does UMA mean zero-copy? Reduced copies?
AB:
CW: In WebGPU we have this idea we call "triply mapped buffers" which are visible to the GPU, the browser "GPU process", and the browser "content process" all at once. Possible but very experimental. On Windows may stress the memory system of the OS. On Vulkan not super clear how many devices support it and how well. So unfortunately can't guarantee mappable buffers get triply mapped.
JS: If there's one copy from content process to GPU process and zero-copy after that, is that OK?
AB: Zero copy ideal, but eliminating synchronization barriers really good. I've been profiling this mostly in native WebGPU so no browser processes involved.
CW: Overall, yes please: want to avoid the expensive memory barriers. Very interested in having your input, and as an oracle of whether it actually fixes the memory barriers.
MW: Many devices with low memory limits like 1.5GB for a web process, zero-copy helps a lot with this [reducing memory pressure].
KG: whiteboard:
- UMA pool sizes:
  - None: Harsh Reality
  - Partial: uniforms, dynamic vert data
    - On discrete, have about 256MB memory you can DMA from CPU
  - Full: texture data uploads
    - But with ReBAR you can get large DMA regions.
JS: Addresses barrier issues?
CW: No. Think we need to collect all of the problems and try to find a solution. It won't be perfect.
CW: Know in the past we said we didn't want to expose on discrete because it's pessimizing. But maybe not true. OK if the "UMA" API is slower on discrete.
KG: Need data. If this "None" category is empty, then it becomes a very interesting thing to have some partial UMA with possible performance cliff past 256MB or whatever.
CW: Not sure how …
JS: But doesn't solve synchronization bubbles.
CW: Since our mappable buffers can't be used for anything the user has to enqueue a copy. … applications can have rolling buffers

Nov 19 '24 18:11 Kangz

Hi all,

I've just submitted the proposal on the WebGPU optional feature buffer-map-extended-usages that adds the support of mappable buffer with any other buffer usages.

Motivation

Reduce the memory footprint when transfering a large amount of data between CPU and GPU.
Eliminate the extra copy between the staging buffer and the destination buffer and the related barriers.

New Features

Allows creating a mappable GPUBuffer (created with MAP_READ or MAP_WRITE usages) with any other GPUBufferUsage flags, including the use of both MAP_READ and MAP_WRITE.
Allows READ|WRITE as a valid value for the mode parameter in GPUBuffer.MapAsync().

Implementation Details

Platform Requirements

buffer-map-extended-usages can be supported efficiently on the backends with UMA architecture and below conditions.

Backends	Preferred Requirements
D3D12	`D3D12_FEATURE_DATA_ARCHITECTURE.isUMA &&` `D3D12_FEATURE_DATA_ARCHITECTURE.isCacheCoherentUMA`
Metal	`[MTLDevice hasUnifiedMemory] == true`
Vulkan	`VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT` \| `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` \| `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT` \| `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`

Notes

D3D12
D3D12_FEATURE_DATA_ARCHITECTURE.isCacheCoherentUMA == true is recommended according to this document.
Vulkan
- HOST_VISIBLE_BIT is required for the access on CPU;
- DEVICE_LOCAL_BIT is preferred for the access on GPU;
- HOST_CACHED_BIT is preferred as such memory type provides cached storage on the CPU;
- HOST_COHERENT_BIT is preferred because without this bit we need to manually call vkFlushMappedRanges() when CPU has completed writing data and vkInvalidateMappedRanges() when the GPU data has been written back.
According to this document, "do not read back data from uncached memory on the CPU", so we require the at least DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT for buffer-map-extended-usages, and DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT | HOST_COHERENT_BIT is preferred to guarantee the performance of CPU read.

DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT | HOST_COHERENT_BIT is only available on 39.85% Vulkan devices with no Mali GPU supported, while the coverage of DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT is 56.32%.

Data Transmissions in Browser

Map Mode	Buffer is Read-only on GPU	Return values of `GetMappedRange()`	Data in the returned array buffer After `Unmap()`
`MAP_READ`	-	Buffer's current values	Discarded
`MAP_WRITE`	`true`	Buffer's current values	Stored in the GPUBuffer
`MAP_WRITE`	`false`	The default initialized data (zeros) or data written by the webpage during a previous mapping	Stored in the GPUBuffer
`MAP_READ`\|`MAP_WRITE`	-	Buffer's current values	Stored in the GPUBuffer

gpuweb
gpuweb copied to clipboard

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use

Background

Goals