WebGPURenderer: Workgroup Arrays and Barrier Support
Description
Add ability to create workgroup and private arrays within compute shaders, which can be used to accelerate compute operations. Ideally could be used for providing pre-written compute operations that are fast and useful out of the box ( bitonic sort, prefix sum ). This would probably be less useful to the end user, though those only targeting WebGPU devices may find some benefit out of using this functionality.
If requested, I can try to provide samples for some of this functionality, like porting the existing WebGPU Bitonic sort sample or doing something with spatial hashing and prefix sums, though this will likely require the ability to query the value of local_invocation_id and workgroup_id within TSL.
- [x] Node implementation
- [x] Storage Buffer Sample Fix
- [ ] Bespoke Workgroup Array Sample
š¦ Bundle size
Full ESM build, minified and gzipped.
| Before | After | Diff | |
|---|---|---|---|
| WebGL | 685.24 169.64 |
685.24 169.64 |
+0 B +0 B |
| WebGPU | 826.41 221.63 |
827.96 222.09 |
+1.55 kB +466 B |
| WebGPU Nodes | 825.99 221.54 |
827.54 222 |
+1.55 kB +461 B |
š³ Bundle size after tree-shaking
Minimal build including a renderer, camera, empty scene, and dependencies.
| Before | After | Diff | |
|---|---|---|---|
| WebGL | 462.02 111.48 |
462.02 111.48 |
+0 B +0 B |
| WebGPU | 525.5 141.64 |
526.17 141.84 |
+671 B +193 B |
| WebGPU Nodes | 482.15 131.46 |
482.83 131.65 |
+671 B +192 B |
I'm also not really a fan of the ScopedArrayNode name. Willing to take any suggestions on a name that would make more sense ( ComputeArrayNode, ComputeLocalArrayNode, ComputeAccess, etc)
Iām not a big fan of using "Scope" in a node name either. How about WorkgroupInfoNode? It clearly indicates workgroup-level data and aligns with WebGPU and WGSL terminology. Or maybe DomainArrayNode?
For example:
export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) );
export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) );
By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? š
Iām not a big fan of using "Scope" in a node name either. How about
WorkgroupInfoNode? It clearly indicates workgroup-level data and aligns with WebGPU and WGSL terminology. Or maybe DomainArrayNode?For example:
export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) ); export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) );By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? š
Naming:
I'll change the name to WorkgroupInfoNode. I'll also remove privateArray for now. I don't really see it's utility when WGSLNodeBuilder already constructs all code within one function body. However, I'll leave the scope property just in case there are other potential workgroup local variable types ( var
Future Plans
My current plan is:
- Get current pull requests merged in
- Finish protoplanet port ( people seem to like ports of old samples and InstancedPointsNodeMaterial needs to be fixed)
- Finish extant pull requests that can be finished (post-processing, arrayCamera, etc)
- Do required reading on subgroups, physics, workgroup sync, maybe atomics for a week to get myself back up to speed.
- Move onto new/more creative/more demanding uses of compute like SPH, Instanced Points FLIP, Spatial Hash Collisions with Workgroup or Subgroup Sync, etc.
So TLDR: Yes š
Would it be complicated to have an example using these features in this PR?
Would it be complicated to have an example using these features in this PR?
Shouldn't be too complicated, I can write one using invocationLocalIndex.
Moved to draft until samples are created.
@sunag The WebGPUBackend side of the Storage buffer sample is now fixed with the addition of a single workgroupBarrier() call. This call prevents data from being accessed and written to at the same time. This is separate from the addition of a new sample that will complete this pull request.
https://github.com/user-attachments/assets/1b567f93-40d6-41be-90fe-7835e4461abc
Just wanted to give a brief update since this took longer than originally expected. Two things have happened:
- Moving for a new job, so haven't had the time to give PRs proper attention.
- Sort seems to work under certain conditions but there's some weirdness with the uniforms that I haven't been able to figure out yet. I'll add more detail below once I've figured out what the exact issue is.
Sort is now working:
https://github.com/user-attachments/assets/dc1662a5-1b68-41c9-afd4-9bbcaf312c1b
@cmhhelgeson Looks great! I will review and merge it soon, thanks
Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting.
This could better reflect how bitonic sort is typically optimized for parallel processing:
Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group. Phase 2: Once the local sorting is done, proceed to global sorting across workgroups to finalize the order.
This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought!
Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting.
This could better reflect how bitonic sort is typically optimized for parallel processing:
Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group. Phase 2: Once the local sorting is done, proceed to global sorting across workgroups to finalize the order.
This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought!
Maybe I'm misunderstanding your suggestion, but the example already does this. It will peform only local swaps until the span of a swap necessitates that the swap be performed globally. The purpose of the computeAlgo function is to ensure that the correct swap function is executed given the span length. It's not mixing global and local swaps at random.
EDIT: For instance, in the debug panel of the reference implementation, whose code I've ported into TSL, Next Step will always be a Local step until the Next Swap Span exceeds workgroup_size * 2: https://webgpu.github.io/webgpu-samples/sample/bitonicSort
Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both features.
Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both
Maybe we could color code local amd global swaps somehow.
EDIT: There are probably further performance improvements that could come down the line with the implementation of switch statements, but for now, I'll try to determine a way to make to indicate the locality of a sort in an interesting visual manner before closing this out.
@RenaudRohlinger The example has been updated to more clearly demonstrate when a local sort or when a global sort is occurring.
Probably the best example of a sorting algorithm I've ever seen! š
Is there anything else that needs to be done here?