Nabla
Nabla copied to clipboard
Single Dispatch Radix Sort
Description
Port ext::RadixSort to the New API and promote it to core at the same time.
Description of the related problem
Take example from the video::CScanner and implement a video::CRadixSorter<key_uint_t,reference_uint_t> which performs the all radix sort steps in a single dispatch.
Solution proposal
Override the default scanning shader and add more work to the first and last workgroup.
First prototype can do 1 dispatch per radix.
Additional context
Atomic spinlock synchronisation of the scan works like this...
Previous workgroups atomically increment by 1 a unique per-level counter at offset localWorkgroupIndexInThisLevel/WORKGROUP_SIZE when they finish computation (every such offset is assigned to a higher level workgroup).
The next level workgroup spins on the counter assigned to it, until it reaches a value equal to the total number of preceeding workgroups feeding data to this workgroup.
Extending this to arbitrary number of scans with work on the original elements inbetween, just requires careful allocation of synchronisation counters.
To tackle in 2022, we don't need this yet.
Cannot be done without UB, only the Upsweep Reduction Phase can be single dispatch, downsweep needs multiple