VkRadixSort
VkRadixSort copied to clipboard
Compared to cub radix sort
According to my test compared with cub device radix sort, the speed of this implemention is about 3 times slower than cub for 16<<20 uint32_t elements, which is about 4ms vs 1.3ms on RTX4090.
As far as I know, cub uses decoupled look back to improve the scan operation speed. Any interest on making this more efficient by switching to the sota scan algorithm?