WebGPURenderer: Current UBO system has severe performance issues with many render items.
Description
Summary
When switching from WebGLRenderer to WebGPURenderer, I experience a significant drop in performance. The same scene, containing thousands of non-instanced meshes, runs smoothly at 60 FPS on WebGL but drops to 15 FPS on WebGPU, a 4x decrease in performance.
Expected Behavior
WebGPURenderer should provide comparable or better performance than WebGLRenderer, given its modern API and intended improvements over WebGL.
Current Behavior
Rendering 20,000 non-instanced basic cube meshes:
WebGLRenderer: ~60 FPS on Mac (Apple Silicon M1 Pro)
WebGPURenderer: ~15 FPS (4x slower)
No errors or warnings appear in the Chrome console.
Reproduction steps
- Create a Three.js scene with 20,000 Mesh instances.
- Use WebGLRenderer and observe smooth 60 FPS performance.
- Switch to WebGPURenderer by uncommenting the renderer swap.
- Observe FPS dropping significantly (down to 15 FPS).
Code
see live example below
Live example
https://jsfiddle.net/15zfestk/1/
Screenshots
No response
Version
r.0.173.0
Device
Desktop
Browser
Chrome
OS
MacOS
Some more context: https://discourse.threejs.org/t/why-webgpurenderer-performance-significantly-lower-than-webglrenderer/77629/9
The live example uses more or less the worst case setup for the renderer. Many objects which update their transformation every frame. Since the example uses no instancing or batching, existing performance issues in WebGPURenderer are exhibited.
Because every object has its own UBO for managing their object scope uniforms, each frame all UBOs must be bound and updated which seems to cause a considerable amount of overhead. The WebGL backend spends most time for the bindBufferBase(), bindBuffer() and bufferData() calls.
To further explain the major performance gap: This is how WebGLRenderer renders the scene when four cubes are configured:
There are no major state changes between the draw calls (except for some single uniform updates which are not displayed in the list). Compared to that, WebGPURenderer does the following:
As you can see, there is a considerable amount of state changes between each drawElements() command. Many scenes won't have an issue with this because the number of render objects is low. But the more render objects you have in a scene, the sooner WebGPURenderer gets CPU limited.
#30562 fixes the VAO related issues but they are unfortunately negligible compared to the UBO related overhead. I guess we need a different approach in the renderer to minimize these state changes.
@sunag @RenaudRohlinger @aardgoose Would be a single UBO for all object scope uniforms a potential solution?
Nice catch with the VAO!
Related (I like the CommonUniformBuffer interface): https://github.com/mrdoob/three.js/pull/27388
Unless we implement a pool system I don't think we can use a single UBO for all object-scope uniforms as a potential solution since we'd be very limited by the number of meshes. With a typical 16KB max block size and each mat4 taking 64 bytes in std140, that limits us to about 256 meshes.
Good to know that. I hope we can revisit https://github.com/mrdoob/three.js/pull/27388 soon.
With a typical 16KB max block size
This is the limitation per draw call as guaranteed by the WebGL 2 specification. You can have a larger buffer bound and adjust the offset dynamically. I've shared many words of this and scheduling in general, but I'm not sure they've been heard, reading this.
Can this issue be clarified, is it performance with the webGL fallback backend that the OP has an issue with or the WebGPU backend or both?
Re #27388, I'll revisit it in a few weeks time, I recall looking at applying a similar mechanism to the WebGL fallback, but found that more complicated because of the different api styles rather than a buffer size issue, although I'd have to check.
Can this issue be clarified, is it performance with the webGL fallback backend that the OP has an issue with or the WebGPU backend or both?
Both backends have the performance issue.
. You can have a larger buffer bound and adjust the offset dynamically.
When I understand the spec correctly, you can use gl.bindBufferRange() in WebGL 2 for that purpose.
https://registry.khronos.org/OpenGL-Refpages/es3.0/html/glBindBufferRange.xhtml
In WebGPU, it should be the dynamicOffsets parameter of setBindGroup().
https://www.w3.org/TR/webgpu/#gpubindingcommandsmixin-setbindgroup
It seems both APIs are not used in #27388 yet.
@mrdoob @Mugen87 My apologies for the delay. I'm currently dealing with a health issue, but I will look into it as soon as I recover.
#27388 only applies to webGPU, the dynamicOffsets isn't really useful in the current renderer AFAICS. The offset in createBindGroup is all that is required to use a single buffer.
The issue with webGL is that the buffer updates and draw calls are interleaved and executed in a single pass, whereas the webGPU renderer updates the arrayBuffer and queues the draw calls for later execution, this allows the single buffer update to be inserted before the queued draw calls are executed.
For a WebGL solution you need to have two passes through the render list.
- pass 1: Update the intermediate arrayBuffers only.
- write the gl buffer.
- pass 2: draw calls.
This doesn't match the current code structure.
I don't know if this is related to this point or if it is a separate topic. Since r173 I have noticed a frame drop (WebGPURenderer). Suddenly the frame rate drops from 120 fps to 30 fps. Since I haven't changed anything in the app itself, just the threejs release from r172 to r173 and now r174, I keep noticing this. There is no error message, which makes the analysis more difficult. The app runs at 120 fps and suddenly it drops to 30 fps. Sporadically it peaks back to 120fps. Since I'm not allocating any new buffers or new geometries, that's strange. Since up until r172 it always ran at 120fps, something must have happened from r172 to r173.
@Spiri0 I noticed similar issue on Windows after r172, the issue was caused by a memory leak, and should be fixed with https://github.com/mrdoob/three.js/pull/30647
@RenaudRohlinger Your extension seems to make the frame drop less frequent. I was curious and implemented it in my apps. But the frame drop still occurs. An interesting phenomenon.
Both times I just started the app and nothing else. But in the first picture you can see a constant 120fps, which was always the normal case
In the second screenshot, threejs seems to fall into something like a safe mode with 30fps. There are no error messages. Even stranger, as soon as I open the console it suddenly jumps to 120fps. As you can see, the render loops are very fast, less than a millisecond. So Threejs and the app are very much in the green zone in terms of performance.
It's definitely an improvement, because now it jump from 30fps to 120fps instead from 120fps to 30fps. I've tested opening the console several times and so far with your extension it always triggers the jump from the faulty 30fps to 120fps. That's very good because it proves that your expansion is on the right track.
Thanks @Spiri0! Could you try this on different web browser that supports well WebGPU (Chrome Canary, Chrome Beta, Edge) and potentially a different device to confirm that's more on the threejs side rather than how it interact with your browser/GPU?
Also to know your GPU model would greatly help.
Good point. I use slimJet because Chrome on Linux had limited WebGPU support for some time. But with the current version of Chrome it's working normally again. Then slimJet is the cause. I tested it again with maximum WebGPU limits and with the latest Chrome it runs at 120fps again. That's reassuring. I have an ATI Readon XTX7900. What I do runs at 120fps even if I push threejs to the maximum limits of my GPU. This means that threejs has already implemented WebGPU very well
I'm also encountering a similar issue. Are there any future plans or optimization goals for this problem?
Are there any future plans or optimization goals for this problem?
I have plans to work on it this month, after solve this issue I will check it out.
I update large quantities of uniforms with the WebGPURenderer without any problems. However, I do this manually with storage buffers. With these, I can selectively update thousands of uniforms very efficiently on the CPU side. I've implemented something like this privately (#27388). I believe this issue is about this topic.
Are there any recent developments? The current performance of WebGPU remains the biggest hurdle for our migration.
Are there any recent developments? The current performance of WebGPU remains the biggest hurdle for our migration.
@jellychen I would just like to ask for your patience, repetitive questions do not solve anything, and this is not the only task we have, we have many more demands than people working on them.
Are there any recent developments? The current performance of WebGPU remains the biggest hurdle for our migration.
@jellychen You can hire me, and I'll solve your problem. It's already possible to handle it with the WebGPURenderer. But rather than a serious job request, I'd rather point out that this could basically be implemented user-side with what's available. Just on a deeper threejs level.
Are there any recent developments? The current performance of WebGPU remains the biggest hurdle for our migration.
@jellychen I would just like to ask for your patience, repetitive questions do not solve anything, and this is not the only task we have, we have many more demands than people working on them.
I'm really sorry for my lack of patience. We have some projects that are currently stuck at this point. Hopefully, we can get through this smoothly.
@greggman Hi again 👋 . I would like to ask for advice regarding this issue.
Context: In WebGLRenderer, we did not use UBOs in our material system. Uniforms were individually updated via uniform[1234][fi][v]() or uniformMatrix[234]fv(). In WebGPURenderer, we have introduced UBO usage since it is required for WebGPU anyway and we wanted to use the same approach for the WebGL backend as well.
Currently we have shared UBOs for certain data (like camera uniforms) but separate UBOs for each object to maintain the object-scope uniform data (e.g. the world matrix). It turned out that with many render objects, this approach performs way worse than directly updating individual uniforms via e.g uniform[1234][fi][v](). The performance difference is so large that it's obvious we don't use UBOs as intended.
How would you organize object-scoped data with UBOs? Is the idea to have a single large buffer holding the data and just bind a specific range for each draw call (as suggested in https://github.com/mrdoob/three.js/issues/30560#issuecomment-2675082155)? Or are UBOs in general a bad choice for fine-granular data that change per render call? I've heard of developers complaining about UBO usage because of that particular issue. At least in one documented case, the developer abandoned the usage and went back to individual uniform updates since they were faster overall (see https://github.com/mrdoob/three.js/issues/13700#issuecomment-376859354). Even faster than having a single large UBO.
Unfortunately, that is not an option we can use with WebGPU so we need an optimized solution with UBOs.
Here are two links that show how well Three WebGPU works. These were my prototype apps from over a year ago. I selectively update specific areas within the SBO's ( from CPU side ). Threejs uses exactly the right WebGPU tools to handle this efficiently. Don't worry about the licensing stuff. It's just to point out that decompilation is not permitted.
https://the-mars-project-app.site/local_map_viewer_v3/ https://the-mars-project-app.site/local_map_viewer_v4/
Use wasd, arrow keys and Space for movement. I don't have collision detection in these old examples. The apps load new areas as you move around, and this datas and its associated metadatas then need to be efficiently sent to the GPU from the CPU. Doing this with traditional uniforms and textures would be cumbersome. One must be aware that WebGPU is a different world and in many things there is no 1:1 analogy to WebGL. The two examples are intentionally kept somewhat simple, as I've also used more structured surfaces and advanced techniques today. I understand the desire to do things in the familiar way one is used to from WebGL. But if WebGPU were designed to be analogous to WebGL, then WebGPU wouldn't have been necessary, because WebGL could simply have further developed. These two examples demonstrate that Three.WebGPU can very efficiently transfer larger amounts of datas from the CPU to the GPU in each frame. Furthermore, the structs allow for clean clustering. I admit this isn't as straightforward as using simple uniforms in WebGL, but I couldn't have achieved this with WebGL. This is only possible with the WebGPU tools in Three.WebGPU. From my perspective, there is no issue here.
From my perspective, there is no issue here.
Unfortunately, there is. It is definitely an issue when developers migrate to WebGPURenderer and experience noticeable performance degradation in their apps. It is completely valid to have many individual render items in your scene and the renderer must handle this in a performant fashion. The current UBO implementation does not do that.
I would appreciate if this issue is not misused to demonstrate how performant certain apps have been implemented with WebGPU. Let's focus on how to correctly use UBOs with many individual render objects and different per-object uniform data.
I don't know if I can give three.js specific advice.
For WebGPU there are some experiments here.
The short version was, try to allocate one or more larger buffers for all UBOs, update all objects, do one upload per buffer (instead of one per object). There's examples of using device.queue.writeBuffer to update the buffers and buffer.mapAsync. On M1 Mac they both appear to perform about the same. M1 Mac is UMA (Unified Memory Architecture) so I'd expect different results on a PC with an AMD or NVidia GPU. I don't have access to my PC right now so I can't check.
Otherwise:
- putting vertex data in 1 buffer per object is better than 1 buffer per attribute as it's one call to
setVertexBuffer - separating uniforms into global, per material, per object is probably a good idea.
In my tests, WebGPU was significantly faster than WebGL
M1 Mac
| technique | result |
|---|---|
| WebGL using gl.uniform | 6000 cubes at 120fps |
| WebGL using UBOs in one large buffer | 3000 cubes at 120fps |
| WebGPU using one large buffer w/WriteBuffer | 15000 cubes at 120fps |
| WebGPU using 2 large mapped buffers | 14000 cubes at 120fps |
It's disappointing that WebGL is slow with UBOs in Chrome. Checking Firefox I got 5000 cubes using gl.uniform and 4500 using UBOs so maybe there is room for improvement in ANGLE in Chrome for UBOs.
From my perspective, there is no issue here.
Unfortunately, there is.
Then I apologize. I tend to take alternative paths. I'm doing exactly what you want with the UBOs but with the SBOs.
I should add, there are other optimizations you can make, but I don't know which ones are appropriate for three.js
There's Brandon's example of doing frustum culling on the GPU. I'm told that most AAA game engines use this type of technique. But, game engines are often not generic. They design around the engine rather than support every possible use case.
Another thing you can potentially do is use storage buffers. Storage buffers can be larger and more flexible than uniform buffers. So for example, if every object to be drawn uses the same struct then you can make an array of those structs
@group(0) @binding(0) var<storage> perObjectData: array<PerObjectData>;
You can then select which per object data in the array by a single index, for example instance_index which you can pass in in drawXXX. Example
// pseudo code
// draw 1 object and set `instance_index` to `someCubeInstanceIndex`
pass.setVertexBuffer(0, cubeVertexBuffer);
pass.draw(numCubeVertices, 1, 0, someCubeInstanceIndex);
// draw 1 object and set `instance_index` to `someSphereInstanceIndex`
pass.setVertexBuffer(0, sphereVertexBuffer);
pass.draw(numSphereVertices, 1, 0, someSphereInstanceIndex);
// draw 1 object and set `instance_index` to `someSuzanneInstanceIndex`
pass.setVertexBuffer(0, suzanneVertexBuffer);
pass.draw(numSuzanneVertices, 1, 0, someSuzanneInstanceIndex);
...
in WGSL
// pseudo code
struct PerObjectData {
...
};
struct Interop {
@builtin(position) pos: vec4f,
@location(0) @interpolate(flat, either) instance_index: u32,
...
};
@group(0) @binding(0) var<storage> perObjectData: array<PerObjectData>;
@vertex fn vs(@builtin(instance_index) instance_index: u32) -> Interop {
let objData = perObjectData[instance_index];
...
return Interop(pos, instance_index, ...);
}
@fragment fn fs(v: Interop) -> @location(0) vec4f {
let objData = perObjectData[v.instance_index];
...
return color;
}
Again, many games do this kind of thing but they do it by requiring all objects to use the same struct and/or the same pipeline. Even without the same pipeline you can make a struct that is the union of all needed properties. This gets more useful with bindless since you can bind all the textures needed for all objects at once. Bindless support is probably at least year away though
After some more investigations I think we can use #27388 as a starting point. However, to make this work with WebGL as well we need to refactor the renderer a bit.
- After preparing the render lists and before start processing the render items, the nodes must be build for reach render item. This will automatically define the related uniforms which allows the engine to organize the object-scope uniforms into single buffers. So we need an additional iteration over the render lists for building the nodes. Or we do this right when the render item is pushed into the render list.
- @greggman I don't fully understand the size restriction yet. Does the 64KB in WebGPU restrict the overall size of the buffer or just the portion you can bind? Is there a similar size restriction for WebGL 2? I understand there is
MAX_UNIFORM_BLOCK_SIZEand this does indeed just restrict the amount of memory you can bind and not the buffer itself. - When the buffers are prepared, they are updated once via
writeBuffer()in WebGPU andbufferData()in WebGL before start rendering. - Then we use
GPUBufferBindingin WebGPU for describing the bind group entry forcreateBindGroup(). This allows to define an offset into the large shared UBO so we process the relevant data for the draw command. The same is true forbindBufferRange()in WebGL 2.
Does that sound good as a first step to optimize the UBO usage?