compute-shader-101
compute-shader-101 copied to clipboard
Sort experiment
This branch contains an experiment in sorting. It is not intended to be merged, but having a draft PR gives the branch a stable identifier.
The tip contains an implementation mostly adapted from FidelityFX sort, but with a version of warp-local multi-split inspired by Onesweep. In all cases, subgroup operations have been replaced by workgroup shared memory. There are numerous checkpoints, including a mostly-working version without the WLMS and closer to the original FidelityFX. Note, however, that this exhibits failures consistent with a missing barrier. The tip appears to pass correctness tests, but none of this has been carefully validated.
Sort throughput is approximately 1G element/s on M1 Max.