three.js
three.js copied to clipboard
RFC: WebGPURenderer prototype single uniform buffer update / pass
Prototype mechanism to reduce number of writeBuffer()
calls using a single large buffer for all object uniforms groups, which is updated before the renderPass is submitted. As used in some other engines with WebGPU.
All examples run correctly with this PR. Effects greatest with large numbers of objects being rendered. The largest changes are the GPU thread times which are greatly reduced when testing with the webgpu_sprites examples. From 5ms/frame with per object buffer to 2.5ms with single buffer in my brief testing.
No attempt has been made:
- to synchronize the buffer updating and reading
- allow buffer resizing or detect buffer overflow
- recovery of buffer space on object deletion.
Reduce the number of calls from writeBuffer()
is too part of https://github.com/mrdoob/three.js/pull/27134 After configuring, many things called will be reduced, instead of being updated per object, they will be updated per frame.
I like your idea, but I wonder if it wouldn't be better to have this configured in Node and adjusted at setup()
?
Hi @aardgoose
Do you think about fixing the conflicts? I was thinking about merge this PR soon
I'll take a look tomorrow.
Awesome! Is it ready for review @aardgoose? Can you promote it from Draft to PR maybe? š
@RenaudRohlinger will do.
We might want to select specific uniform groups to be managed in this way, which is now possible as the buffer is passed through the NodeBuilder.
An obvious next stage is to look at reclaiming unused buffers, but we need a deallocation mechanism first, when a material is disposed of.
Added lists per extent size (multiple of block size) for freed buffers when objects are removed from the scene graph.. These lists are used for new allocations in preference to free space at the end of the buffer.
Block size is typically 256 bytes (https://web3dsurvey.com/webgpu/limits/minStorageBufferOffsetAlignment).
Added reworked example with continuous removal and addition of new objects and stats demonstrating buffer use. This only uses blocks of 256B or less.
š¦ Bundle size
Full ESM build, minified and gzipped.
Filesize dev |
Filesize PR | Diff |
---|---|---|
685.1 kB (169.6 kB) | 685.1 kB (169.6 kB) | +0 B |
š³ Bundle size after tree-shaking
Minimal build including a renderer, camera, empty scene, and dependencies.
Filesize dev |
Filesize PR | Diff |
---|---|---|
462 kB (111.4 kB) | 462 kB (111.4 kB) | +0 B |
I've been conducting performance benchmarks and believe this PR could significantly enhance the webgpu_performances.html example, particularly within the WebGL backend. It could potentially boost performance from around 30fps to over 120fps.
Due to the force-push, I'm unable to check out the PR myself. If possible, could you give it a try?
Additionally, to address the performance issues in webgpu_performances.html
, Iām considering using gl.bindBufferRange
and gl.bufferSubData
instead of gl.bufferData( gl.UNIFORM_BUFFER, data, gl.DYNAMIC_DRAW )
, it will not solve anything but simply improve overall the UBOs strategy in the WebGL Backend.
While I'm still investigating the exact cause of the performance drop in WebGL, I'm fairly confident this PR addresses a major bottleneck. The issue seems to stem from overwhelming the GPU with hundreds of buffer uploads, or at least CPU-GPU data transfer, which then causes a drop in the subsequent 5-6 frames every 6 frames in the RAF. Although this PR is more of a great feature that will work as a workaround, it should help significantly. In the long term, implementing a caching system in the UBO logic to prevent unnecessary uploads with more precise range might be the real solution to the WebGLBackend performance issues.
/cc @sunag @Mugen87
We need to check if the WebGLBackend
still has redundant calls. Last time I looked at this, the WebGLRenderer
had more state comparators, so it only sends the commands that have actually changed to the WebGL.
I haven't had time to implement UniformGroup on all nodes yet. If we don't do this, we won't be able to achieve optimal performance because the model's matrix groups will be confused with those of the material, causing unnecessary overhead for both backends. I think after this we will be able to implement buffer sharing more safely.