gpuweb
gpuweb copied to clipboard
Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use
Background
WebGPU currently has 2 buffer upload facilities: GPUBuffer.mapAsync() and GPUBuffer.writeBuffer().
For GPUBuffer.mapAsync(), there currently is the restriction that no mappable buffer can be used as anything else, other than COPY. This means that, in order to be useful, an application has to allocate 2 buffers - one for mapping and one for using. And, if an application wants to round-trip data through a shader, they have to allocate 3 buffers - one for the upload, one for the download, and one for the shader. Therefore, in order to use mapAsync(), an application needs to double (or triple) their memory use and add one or two extra copy operation. On a UMA system, neither the extra allocation nor the copy is necessary, which means there's both a perf and memory cost to using mapAsync() on those systems. What's more, because the application is explicitly writing this code, there's not really anything we can do to optimize out the extra buffer allocation / copy operation.
On the other hand, GPUQueue.writeBuffer() is associated with a particular point in the queue's timeline, and therefore can be called even when the destination buffer is in-use by the GPU. This means that the implementation of writeBuffer() is required to copy the data to an intermediate invisible buffer under-the-hood, even on UMA systems, and then schedule a copy operation on the queue to copy the data from the intermediate buffer to the final destination. This extra allocation and extra copy operation don't necessarily need to exist on UMA systems. (GPUBuffer.writeBuffer() is a good API in general because of its simple semantics and ease of use, but it does have this drawback.)
It would be valuable if we could combine the best parts of GPUBuffer.mapAsync() and GPUQueue.writeBuffer() into something which doesn't require an extra allocation or copy on UMA systems. This kind of combination would have to be something that isn't UMA-specific, but would work on both UMA and non-UMA, and UMA systems would be able to avoid extra allocations/copies under-the-hood.
Goals
- The "async" part of
GPUBuffer.mapAsync()would be valuable, because that allows the implementation to not have to stash any data due to the destination buffer being busy. - The "map" part of
GPUBuffer.mapAsync()would be valuable because it allows the array buffer to be backed directly by GPU memory, thereby potentially avoiding another copy on UMA systems. - The "queue" part of
GPUQueue.writeBuffer()would be valuable, because non-UMA systems would need to schedule an internal copy to the destination, and specifying the queue gives them a place to do that.
Proposal
I think the most natural solution to this would be:
- Give
mapAsync()an extraGPUQueueargument. (getMappedRange()andunmap()will implicitly use this queue). We could also say that the queue is optional, and if it's unspecified, the device's default queue will be used instead. - Relax the requirement that the only other usage a mappable buffer can have is COPY
That's it!
- On a UMA system, you'd be able to map the destination (storage) buffer directly - No copies, no extra allocations, it's living the UMA dream.
- For reading,
mapAsync()would just ignore itsGPUQueueargument. - For writing,
mapAsync()would use itsGPUQueueargument to schedule aclearBuffer()command of the relevant region of the buffer. After the clear operation is complete, the map promise would be resolved.
- For reading,
- On a non-UMA system:
- For reading,
mapAsync()would schedule a copy from the source (storage) buffer to a temporary buffer using the specified GPUQueue (which is what an application would have had to do themself), and the map operation would proceed just like normal on the temporary buffer. This is exactly what an author would have had to do themself. - For writing,
mapAsync()would just stash the queue, map a temporary buffer, and wait forunmap()to be called. Whenunmap()is called, it would schedule a copy on the stashed queue from the temporary buffer to the destination buffer. This is exactly what an author would have had to do themself.
- For reading,
It's important to note is that this proposal doesn't restrict the amount of control a WebGPU author has. If an author wants to allocate their own map/copy buffer and explicitly copy the data to/from it on its way to its final destination (as they would do today), they can still do that, and no invisible-under-the-hood temporary buffers would be allocated.
This proposal also has a natural path forward for read/write mapping.
Overall I'm worried about modifying the buffer mapping mechanisms this late, when it was perhaps the single most difficult part of the API to find a good design for and reach consensus on. I think a UMA optional feature would make a
We discussed the slight inefficiencies that UMA has with the current mapping mechanism multiple times in F2F, and https://github.com/gpuweb/gpuweb/issues/605 suggests a UMA feature could be done to optimize this later (later could be now). The proposal is interesting but has some issues.
Side note: there is also mappedAtCreation which was added so that the initial upload of data in buffers can be made perfectly efficient on UMA systems. (up to the copies necessary for the process separation).
Multiple buffers per GPUBuffer
It breaks down the model that one WebGPU buffer == one underlying API buffer. This is a pretty useful thing to keep because it makes it very clear to the developer what the memory cost of things is. That you can have temporary staging for mappedAtCreation, and potentially shmem wrapped in the ArrayBuffer given to JS is already very difficult for developers to reason about in terms of cost.
Cost of consistency for MAP_WRITE
Currently MAP_WRITE buffers give you an ArrayBuffer that contains the current content of the buffer. Since the buffer can only be written by Javascript, no copies are ever needed to update the content of the buffer, it's just the ArrayBuffer wrapping shmem shared between the GPU and Web processes. If the GPU can write to the buffer, then we need to copy data from the UMA buffer to the shmem (or even worse, from VRAM to readback to shmem).
On the other side, if we say that mapAsync(MAP_WRITE) always zeroes the buffer, then in most cases the CPU has to zero the buffer, since the UMA/readback buffer isn't shared with the Web process. Either there's a memset(0) or a memcpy from the UMA buffer to the shmem.
Consistency for MAP_READ
What happens when Javascript writes into the buffer mapped for reading? Assuming you are able to create an MTLBuffer from a shmem FD to reduce the number of copies to the maximum, the writes that Javascript did all of a sudden become visible to the GPU, while an all other configurations, Javascript writing to the buffer doesn't have any visible effect for the GPU?
Relatively little gains and a feature proposal
The gains you get with the proposal you suggested seem small: if you have a large amount of data to initialize buffers with, then you can use mappedAtCreation that's the optimal path. If you need to modify part of a buffer while it's in use, then you have to schedule a copy because mapping is an ownership transfer of the full buffer (I tried to figure out how to do sub-range mapping efficiently but gave up).
So the cases this help are when you need to upload data to a buffer that's not being created, but also not currently in use by the GPU. This should be a fraction of the actual buffer transfers. Still might be worth speccing an optional feature, but not modifying the core buffer mapping spec.
The optional "UMA" feature could:
- Lift the restriction for
MAP_WRITEto allow any other read-only usages. MAP_READalready allows all the write-only usages. But maybe more are added in the future so the extension would also lift that? Or it allows it with any other usages, but assumes there is always a UMA -> shmem copy happening in the GPU process (so that JS writes are never made visible to the GPU).
I was going to have comments but @Kangz covered everything I was going to say and more.
It breaks down the model that one WebGPU buffer == one underlying API buffer.
This isn't true. This proposal requires scratch space, certainly, but so does writeBuffer(). It's no worse.
What happens when Javascript writes into the buffer mapped for reading?
This is a good point! I suppose this proposal only makes sense for read/write buffers (which we don't have today, but I think has a natural path forward).
It's relatively rare for an application to actually need read/write mapping. Sure we could add them for this use-case, but applications would still need to know which one to use and explicitly switch between them based on whether the adapter is UMA or not.
This isn't true. This proposal requires scratch space, certainly, but so does writeBuffer(). It's no worse.
WriteBuffer is quite explicitly an implementation-managed ringbuffer. But it is not tied to a GPUBuffer, it's only GPUDevice extra memory. Plus it doesn't need to be persistent data. The implementation can destroy the ringbuffers when there is memory pressure while extra backings for GPUBuffer would have to stay, otherwise you could get an OOM trying to map the buffer.
I'm not saying I'm for or against any proposal here, only voicing that I agree what's in the API today does not fully satisfy workflows with dynamic data moving across host/device and that this will be a performance issue in real-world usage. In compute workloads getting data back from the device is a major part of the upload -> compute -> download flow and until we have a GPUQueue.readBuffer (🙏 please!) this results in non-trivial complexity in user code and bloat.
But it is not tied to a GPUBuffer, it's only GPUDevice extra memory.
👍 IMO having the implementation manage the ringbuffer with writeBuffer/readBuffer and incurring a copy is acceptable if the alternative is managing exclusively upload or download buffers in user code (as I found the spec detail that indicates a buffer cannot be both, resulting in user staging pools needing double the memory for bidi transfer). This way multiple libraries trying to perform upload/download are not each keeping around large GPUBuffers for this purpose.
you could get an OOM trying to map the buffer.
This is no worse than the possibility of an OOM when trying to writeBuffer(), though...
Sure that's a possibility as well, although implementation could stall to free staging space if they really want to.
The point here wasn't that writeBuffer can or cannot OOM, it is that you want to give developers a way to do transfers without the possibility to trigger unfixable OOM. Buffer mapping can do that since OOM only happens at buffer creation. If you choose to make staging / readback buffer transient, then you lose this control in the application because you can OOM on mapAsync as well. It is possible to decide to do that, but need to be cognizant of all the tradeoffs we are making. In this whole comment thread I suggest it is a bad idea for many reasons, including that OOM issue.
I agree with the concerns about managing temporary buffers for mapping, expressed by @Kangz . Their lifetime is attached to mapping, and it's worse than the ring buffer we currently have for the writes.
I also agree with @litherum that it would be good to be able to avoid copies on systems that can do that. An optional feature for UMA architectures seems like the right way to proceed. It would basically lift the restriction on usages for buffers, allowing MAP_READ+MAP_WRITE+anything else.
As for the queue argument for mapping, this correlates with https://github.com/gpuweb/gpuweb/issues/1977#issuecomment-884436273. It's probably needed.
I'm happy to help by writing an optional feature for UMA that allows MAP_READ + WRITE.
It would be pretty unfortunate if authors had to opt-in to avoid using 2x memory on UMA machines.
I don't think anyone is disagreeing about that, but if we're going to avoid it we're going to need a proposal that works. I don't think we're getting any closer to one.
WebGPU meeting minutes 2022-02-23
- KN: nobody satisfied with current state, but nobody has a better idea. Everyone's resigned to this fate except Myles. :)
- KG: one thing that has changed since it was first discussed - more common today than 2 years ago to get adapters that let you map CPU read and host read/device use - used to be UMA archs only, and some AMD cards - but has changed now.
- KN: right. Intel doesn't even have some of these options (host-coherent + device-coherent?). Think we could do this on Intel regardless.
- KG: if something we can't support - don't want to fragment the ecosystem by making you write 2 paths. If things have changed - still do need that.
- KN: we don't have solution for doing this underneath the hood of the application. Can do it with an extension. Would like to. Maybe we should do it for 1.0. Would need separate code path for application.
- KR: WebGL doesn't have the ability to optimize for this and performance in this area is basically fine. I think WebGPU will also perform fine in general without this optimization, and since applications will have to add a new code path to take advantage of it, think this should be pushed out to post-V1.
So here's the proposal for the extension: what I wrote above
The optional "UMA" feature could:
- Lift the restriction for MAP_WRITE to allow any other read-only usages.
- MAP_READ already allows all the write-only usages. But maybe more are added in the future so the extension would also lift that? Or it allows it with any other usages, but assumes there is always a UMA -> shmem copy happening in the GPU process (so that JS writes are never made visible to the GPU).
With the addition that if readonly ArrayBuffers become a thing, then we can lift all restrictions on MAP_READ (except MAP_WRITE? not sure), by making the ArrayBuffer returned by mappings for reading be readonly.
WebGPU meeting minutes 2022-03-16
- Myles has been writing lots of webgpu patches instead of thinking about this; can we defer a week?
- CW: UMA storage buffers - I made a proposal a couple lines long. We can have a UMA extension - enable map() with writable buffers with READ_ONLY usage. And vice versa.
- CW: If later we have read-only ArrayBuffers we can have other functionality.
- CW: would be nice if we were able to say - you can have any usage and it just works - but not possible on D3D, and isn't best thing to do. Lots of complexity, e.g. with discrete GPU and also cross-process. Can have a memory "thing" which spans all 3 items - GPU, GPU process, renderer process. Also need consistent behaviors on all systems. Writes to JS have to be visible to JS for readable buffers. Complicated.
- CW: that's why I think only way for proper UMA support is via an extension.
- MM: would this extension also be present on discrete cards? And extension would say, your writes might not be present if you read form it?
- CW: no, behavior should be consistent always, regardless of extension being enabled. That's the main goal.
- MM: so app needs: if (uma) { … } else { … }?
- CW: yes. App can get best behavior on UMA and desktop - there are cases today where you can do the optimal path is buffer mapped at creation, or update buffer in pipelined fashion during GPU execution.
- CW: case not handled: big buffer, need to change data after creation, but not always used by the GPU. Don't know when apps would do this. Useful to think of UMA extension because it helps that case. We should already be pretty optimal in most cases though.
- MM: think argument makes sense. Not 100% sure I agree. First statement about 2 buffer upload mechanisms is false though - there's a third, mapAtCreation. That would work when streaming data from CPU to GPU. Not the other way around though.
- MM: backward direction is definitely less common. Need to do more research.
- MM: other thing - we should try to describe somewhere that mappedAtCreation's expected to be more performant than creating buffer and mapping it.
- CW: should be in non-normative text at least. Brandon made a best practices doc on uploading data with WebGPU. writeBuffer - but mappedAtCreation's pretty good, too.
- BJ: that doc's in flux - please suggest improvements.
- MM: committed to our repro?
- BJ: not yet. Not a good time.
- MM: link to it please?
- BJ: will do. https://github.com/toji/webgpu-best-practices/blob/main/buffer-uploads.md
- CW: think everyone wants to make UMA work amazingly well. But, amazingly hard while keeping consistent behavior from JS side, and keeping D3D constraints in mind, and single source for GPUBuffer, etc. Optimizations you want to in the browser later too. Happy to discuss details with people. Wish we had a better story for UMA, but I can't find one.
- MM: believe you, just don't think we should say it's impossible.
- CW: also happy to discuss offline more. Maybe in office hours.
As discussed in the meeting, moving to post-V1 polish since the only proposal so far is an optional feature.
@Kangz: You appear to have a broken sentence here:
I think a UMA optional feature would make a
I think the gist was "UMA would make sense to put in an optional feature"
GPU Web 2023-06-07/08 (Pacific time)
- Recap the design constraints for this problem
- MM: wanted to touch base before going off and doing a bunch of engineering
- We're interested in UMA working well
- Interested in a potential solution where the same code would "do the right thing" on UMA and non-UMA
- This group posited that that was not possible
- I think it might be
- Want to nail down what the original objections were
- KR: from our side we need enga@ and cwallez@ present for the conversation. Would like to advance this on the Github issue or mailing list.
- Postpone for a week?
- KG: I can try to synthesize
- KG: on non-UMA archs you sometimes need 2 copies, and on UMA you can get to 1 copy. How to pipeline, prioritizing bandwidth/latency, is where Corentin and I ran aground trying to find a single API to do both.
- KG: My position - if you try to figure out the API for these things, you'll either prove us wrong or right, and that's great
- MM: that's reassuring. Think we're in a different situation now than 2021. Now we have 2 ways of getting data on the card. I'd be coming back with a 3rd way. Adding a 3rd way isn't great for the platform, but if an app cares about the tradeoffs, we'd have more options for them.
- Continue this next week.
The current restriction on buffers created with map flags has problems also on NUMA architectures. Small, frequently updated uniform buffers can be stored in system memory without significant impact on performance. In addition, with the advent of Resizable BAR and SAM, it is possible to write data directly to VRAM using the CPU (we can even write textures directly to VRAM and change later the access pattern from linear to swizzled, for better bandwidth)
GPU Web WG 2024-10-29/30 Mountain View F2F
- Interested: Mike Wyrzykowski
- AB: Issue bigger than just memory bandwidth. Also serious performance impact on barriers. (no one was taking notes here). End up with all-graphics–to–all-graphics barrier. 15% regression just from barriers.
- …
- AB: If we know the buffer is not scheduled for use, but it's mappable (in vulkan terms, which is basically all buffers on UMA), and any gpu writes are visible to the host, we know we can skip the upload buffer and write directly. For all of these conditions to be true, there is a hidden fast path where developers have to keep one buffer per frame in flight.
- CW: This is why it gets a star. Being in the browser makes it more complicated…
- JS: Does UMA mean zero-copy? Reduced copies?
- AB:
- CW: In WebGPU we have this idea we call "triply mapped buffers" which are visible to the GPU, the browser "GPU process", and the browser "content process" all at once. Possible but very experimental. On Windows may stress the memory system of the OS. On Vulkan not super clear how many devices support it and how well. So unfortunately can't guarantee mappable buffers get triply mapped.
- JS: If there's one copy from content process to GPU process and zero-copy after that, is that OK?
- AB: Zero copy ideal, but eliminating synchronization barriers really good. I've been profiling this mostly in native WebGPU so no browser processes involved.
- CW: Overall, yes please: want to avoid the expensive memory barriers. Very interested in having your input, and as an oracle of whether it actually fixes the memory barriers.
- MW: Many devices with low memory limits like 1.5GB for a web process, zero-copy helps a lot with this [reducing memory pressure].
- KG: whiteboard:
- UMA pool sizes:
- None: Harsh Reality
- Partial: uniforms, dynamic vert data
- On discrete, have about 256MB memory you can DMA from CPU
- Full: texture data uploads
- But with ReBAR you can get large DMA regions.
- UMA pool sizes:
- JS: Addresses barrier issues?
- CW: No. Think we need to collect all of the problems and try to find a solution. It won't be perfect.
- CW: Know in the past we said we didn't want to expose on discrete because it's pessimizing. But maybe not true. OK if the "UMA" API is slower on discrete.
- KG: Need data. If this "None" category is empty, then it becomes a very interesting thing to have some partial UMA with possible performance cliff past 256MB or whatever.
- CW: Not sure how …
- JS: But doesn't solve synchronization bubbles.
- CW: Since our mappable buffers can't be used for anything the user has to enqueue a copy. … applications can have rolling buffers
Hi all,
I've just submitted the proposal on the WebGPU optional feature buffer-map-extended-usages that adds the support of mappable buffer with any other buffer usages.
Motivation
- Reduce the memory footprint when transfering a large amount of data between CPU and GPU.
- Eliminate the extra copy between the staging buffer and the destination buffer and the related barriers.
New Features
- Allows creating a mappable GPUBuffer (created with
MAP_READorMAP_WRITEusages) with any other GPUBufferUsage flags, including the use of bothMAP_READandMAP_WRITE. - Allows
READ|WRITEas a valid value for themodeparameter inGPUBuffer.MapAsync().
Implementation Details
Platform Requirements
buffer-map-extended-usages can be supported efficiently on the backends with UMA architecture and below conditions.
| Backends | Preferred Requirements |
|---|---|
| D3D12 | D3D12_FEATURE_DATA_ARCHITECTURE.isUMA && D3D12_FEATURE_DATA_ARCHITECTURE.isCacheCoherentUMA |
| Metal | [MTLDevice hasUnifiedMemory] == true |
| Vulkan | VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT |VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |VK_MEMORY_PROPERTY_HOST_CACHED_BIT |
Notes
-
D3D12
D3D12_FEATURE_DATA_ARCHITECTURE.isCacheCoherentUMA == trueis recommended according to this document. -
Vulkan
HOST_VISIBLE_BITis required for the access on CPU;DEVICE_LOCAL_BITis preferred for the access on GPU;HOST_CACHED_BITis preferred as such memory type provides cached storage on the CPU;HOST_COHERENT_BITis preferred because without this bit we need to manually callvkFlushMappedRanges()when CPU has completed writing data andvkInvalidateMappedRanges()when the GPU data has been written back.
According to this document, "do not read back data from uncached memory on the CPU", so we require the at least
DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BITforbuffer-map-extended-usages, andDEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT | HOST_COHERENT_BITis preferred to guarantee the performance of CPU read.DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT | HOST_COHERENT_BITis only available on 39.85% Vulkan devices with no Mali GPU supported, while the coverage ofDEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BITis 56.32%.
Data Transmissions in Browser
| Map Mode | Buffer is Read-only on GPU | Return values of GetMappedRange() |
Data in the returned array buffer After Unmap() |
|---|---|---|---|
MAP_READ |
- | Buffer's current values | Discarded |
MAP_WRITE |
true |
Buffer's current values | Stored in the GPUBuffer |
MAP_WRITE |
false |
The default initialized data (zeros) or data written by the webpage during a previous mapping |
Stored in the GPUBuffer |
MAP_READ|MAP_WRITE |
- | Buffer's current values | Stored in the GPUBuffer |
Related Topics
MAP_WRITE + any other read-only GPUBufferUsage flags in WebGPU Core SPEC
On non-UMA architecture the above combination can also be implemented with an internal staging buffer and an implicit buffer-to-buffer copy inside WebGPU implementation.
- I prefer not supporting it in the WebGPU core SPEC as the staging buffer cannot be managed and even noticed by the developer, while I think WebGPU should give the developer more controls on the GPU resources.
MAP_WRITE + any other GPUBufferUsage flags with zero GPU copy as an optional WebGPU feature
The above combination can be supported on the platforms that support buffer-map-extended-usages or the ones with Resizable Base Address Register (ReBAR) enabled.
ReBAR is a PCIe capability that allows the PCIe device, such as a discrete graphics card, to negotiate the BAR size to optimize system resources. Without ReBAR the CPU can only access small portions of the GPU memory (256MB) at a time. With ReBAR the CPU can access the entire GPU memory (VRAM).
Below are the platform requirements that can explicitly take advantage of ReBAR:
| Backends | Requirements |
|---|---|
| D3D12 | Supports heap type D3D12_HEAP_TYPE_GPU_UPLOAD:D3D12_FEATURE_DATA_D3D12_OPTIONS16.GPUUploadHeapSupported == true(since Windows 11 24H2) |
| Metal | None |
| Vulkan | VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT |VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |VK_MEMORY_PROPERTY_HOST_COHERENT_BIT and VkMemoryHeap.size is the whole GPU memory instead of 256MB |
- For the WebGPU implementation with a standalone GPU process it is fine as we always need to copy data from CPU to GPU.
- For the WebGPU implementation with single process there is a problem that the CPU readback is slow as there is no cache on the CPU side for
ReBAR, so it will be very slow to read any data from that mapped pointer.
DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT | HOST_COHERENT_BIT[= 15] is only available on 39.85% Vulkan devices with no Mali GPU supported
while the coverage of
DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT[= 7] is 56.32%.
That link is to DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_CACHED_BIT = 11.
56%, including 45 AMD, 20 NVIDIA, 43 Intel, 785 ARM, 1234 Qualcomm, 227 ImgTec.
DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT = 7:
85%, including 849 AMD, 323 NVIDIA, 151 Intel, 802 ARM, 1168 Qualcomm, 236 ImgTec.
BTW, both have a few extra devices which used extension versions of the same capability bits.
Can you give some examples of how you expect users to use the new APIs? For when users want to support the fastest path for UMA/ReBAR systems, but still also support other systems. I'm having a hard time figuring out what the proposal means in practice.
I'm very excited for this capability though, as without dedicated transfer queues or the proposed changes, there's no way to upload to buffers without blocking the rendering queue.
For texture uploads, we still need something like VK_EXT_host_image_copy to not take up the device queue while uploading textures.
Can you give some examples of how you expect users to use the new APIs? For when users want to support the fastest path for UMA/ReBAR systems, but still also support other systems. I'm having a hard time figuring out what the proposal means in practice.
That's exactly what we'd like to discuss in the WG.
In my opinion WebGPU should provide more controls over the underlying GPU resources, so we should support the creation of the mappable buffer with any other usages as an optional feature, but that means we have to use different code paths to handle the buffer uploading on different platforms.
I don't understand the requirement for HOST_CACHED_BIT. We need a fast zero-copy path to upload data from CPU to GPU. A developer would write data from CPU->GPU, so there's no need for CPU side caching. Write combining will do everything we need.
Of course programmers have to be careful with write combining to avoid performance pitfalls, but graphics programmers have been used to that since the 20-year-old DirectX and OpenGL with write-only buffer map (those return write-combined memory pointer).
Mobile GPUs have heaps of this type: DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT
There's no support for HOST_CACHED + HOST_COHERENT on many devices. Could we remove the cached requirement?
From the minutes (to be posted here soon):
- CW: problem with just coherent + non-cached is: we're hoping to "triply map" buffers. Create shmem in JS process, send to GPU process, use API-specific method to import that memory into GPU address space. VK_external_memory_host (_fd?). Also Metal and D3D12. If we do this, JS has an ArrayBuffer that becomes uncacheable. Seems weird to expose uncacheable memory to JS. [surprising performance properties]
Of course we could avoid triply-mapping this type of buffer on systems where we can't get HOST_CACHED. As Jiawei's investigation says these bits are only "preferred", not required.
Yes, the minuses are that you need to understand performance implications for write-combined memory. But my argument is that all low level gfx programmers already understand it, since it's the same on all other APIs, including older ones that didn't support persistently mapped buffers. If WebGPU requires CPU cached + coherent, then that feature is pretty much unusable on Android, since there's not enough coverage.