three.js icon indicating copy to clipboard operation
three.js copied to clipboard

WebGPURenderer: Workgroup Arrays and Barrier Support

Open cmhhelgeson opened this issue 1 year ago • 8 comments

Description

Add ability to create workgroup and private arrays within compute shaders, which can be used to accelerate compute operations. Ideally could be used for providing pre-written compute operations that are fast and useful out of the box ( bitonic sort, prefix sum ). This would probably be less useful to the end user, though those only targeting WebGPU devices may find some benefit out of using this functionality.

If requested, I can try to provide samples for some of this functionality, like porting the existing WebGPU Bitonic sort sample or doing something with spatial hashing and prefix sums, though this will likely require the ability to query the value of local_invocation_id and workgroup_id within TSL.

  • [x] Node implementation
  • [x] Storage Buffer Sample Fix
  • [ ] Bespoke Workgroup Array Sample

cmhhelgeson avatar Aug 20 '24 23:08 cmhhelgeson

šŸ“¦ Bundle size

Full ESM build, minified and gzipped.

Before After Diff
WebGL 685.24
169.64
685.24
169.64
+0 B
+0 B
WebGPU 826.41
221.63
827.96
222.09
+1.55 kB
+466 B
WebGPU Nodes 825.99
221.54
827.54
222
+1.55 kB
+461 B

🌳 Bundle size after tree-shaking

Minimal build including a renderer, camera, empty scene, and dependencies.

Before After Diff
WebGL 462.02
111.48
462.02
111.48
+0 B
+0 B
WebGPU 525.5
141.64
526.17
141.84
+671 B
+193 B
WebGPU Nodes 482.15
131.46
482.83
131.65
+671 B
+192 B

github-actions[bot] avatar Aug 20 '24 23:08 github-actions[bot]

I'm also not really a fan of the ScopedArrayNode name. Willing to take any suggestions on a name that would make more sense ( ComputeArrayNode, ComputeLocalArrayNode, ComputeAccess, etc)

cmhhelgeson avatar Aug 20 '24 23:08 cmhhelgeson

I’m not a big fan of using "Scope" in a node name either. How about WorkgroupInfoNode? It clearly indicates workgroup-level data and aligns with WebGPU and WGSL terminology. Or maybe DomainArrayNode?

For example:


export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) );
export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) );

By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? šŸ˜„

RenaudRohlinger avatar Aug 22 '24 11:08 RenaudRohlinger

I’m not a big fan of using "Scope" in a node name either. How about WorkgroupInfoNode? It clearly indicates workgroup-level data and aligns with WebGPU and WGSL terminology. Or maybe DomainArrayNode?

For example:

export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) );
export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) );

By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? šŸ˜„

Naming:

I'll change the name to WorkgroupInfoNode. I'll also remove privateArray for now. I don't really see it's utility when WGSLNodeBuilder already constructs all code within one function body. However, I'll leave the scope property just in case there are other potential workgroup local variable types ( var, etc ). Maybe down the line, we can decide whether we want to rename the class if we create a separate class for workgroup variables holding a single value.

Future Plans

My current plan is:

  1. Get current pull requests merged in
  2. Finish protoplanet port ( people seem to like ports of old samples and InstancedPointsNodeMaterial needs to be fixed)
  3. Finish extant pull requests that can be finished (post-processing, arrayCamera, etc)
  4. Do required reading on subgroups, physics, workgroup sync, maybe atomics for a week to get myself back up to speed.
  5. Move onto new/more creative/more demanding uses of compute like SPH, Instanced Points FLIP, Spatial Hash Collisions with Workgroup or Subgroup Sync, etc.

So TLDR: Yes 😊

cmhhelgeson avatar Aug 22 '24 15:08 cmhhelgeson

Would it be complicated to have an example using these features in this PR?

sunag avatar Aug 24 '24 01:08 sunag

Would it be complicated to have an example using these features in this PR?

Shouldn't be too complicated, I can write one using invocationLocalIndex.

cmhhelgeson avatar Aug 24 '24 02:08 cmhhelgeson

Moved to draft until samples are created.

cmhhelgeson avatar Aug 26 '24 16:08 cmhhelgeson

@sunag The WebGPUBackend side of the Storage buffer sample is now fixed with the addition of a single workgroupBarrier() call. This call prevents data from being accessed and written to at the same time. This is separate from the addition of a new sample that will complete this pull request.

https://github.com/user-attachments/assets/1b567f93-40d6-41be-90fe-7835e4461abc

cmhhelgeson avatar Aug 27 '24 22:08 cmhhelgeson

Just wanted to give a brief update since this took longer than originally expected. Two things have happened:

  1. Moving for a new job, so haven't had the time to give PRs proper attention.
  2. Sort seems to work under certain conditions but there's some weirdness with the uniforms that I haven't been able to figure out yet. I'll add more detail below once I've figured out what the exact issue is.

cmhhelgeson avatar Sep 03 '24 02:09 cmhhelgeson

Sort is now working:

https://github.com/user-attachments/assets/dc1662a5-1b68-41c9-afd4-9bbcaf312c1b

cmhhelgeson avatar Sep 04 '24 02:09 cmhhelgeson

@cmhhelgeson Looks great! I will review and merge it soon, thanks

sunag avatar Sep 06 '24 03:09 sunag

Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting.

This could better reflect how bitonic sort is typically optimized for parallel processing:

Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group. Phase 2: Once the local sorting is done, proceed to global sorting across workgroups to finalize the order.

This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought!

RenaudRohlinger avatar Sep 06 '24 13:09 RenaudRohlinger

Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting.

This could better reflect how bitonic sort is typically optimized for parallel processing:

Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group. Phase 2: Once the local sorting is done, proceed to global sorting across workgroups to finalize the order.

This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought!

Maybe I'm misunderstanding your suggestion, but the example already does this. It will peform only local swaps until the span of a swap necessitates that the swap be performed globally. The purpose of the computeAlgo function is to ensure that the correct swap function is executed given the span length. It's not mixing global and local swaps at random.

EDIT: For instance, in the debug panel of the reference implementation, whose code I've ported into TSL, Next Step will always be a Local step until the Next Swap Span exceeds workgroup_size * 2: https://webgpu.github.io/webgpu-samples/sample/bitonicSort

cmhhelgeson avatar Sep 06 '24 13:09 cmhhelgeson

Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both features.

RenaudRohlinger avatar Sep 06 '24 13:09 RenaudRohlinger

Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both

Maybe we could color code local amd global swaps somehow.

EDIT: There are probably further performance improvements that could come down the line with the implementation of switch statements, but for now, I'll try to determine a way to make to indicate the locality of a sort in an interesting visual manner before closing this out.

cmhhelgeson avatar Sep 06 '24 13:09 cmhhelgeson

@RenaudRohlinger The example has been updated to more clearly demonstrate when a local sort or when a global sort is occurring.

cmhhelgeson avatar Sep 08 '24 23:09 cmhhelgeson

Probably the best example of a sorting algorithm I've ever seen! šŸ˜„

RenaudRohlinger avatar Sep 09 '24 01:09 RenaudRohlinger

Is there anything else that needs to be done here?

cmhhelgeson avatar Sep 10 '24 16:09 cmhhelgeson