VkRadixSort icon indicating copy to clipboard operation
VkRadixSort copied to clipboard

Compared to cub radix sort

Open LRLVEC opened this issue 1 year ago • 2 comments

According to my test compared with cub device radix sort, the speed of this implemention is about 3 times slower than cub for 16<<20 uint32_t elements, which is about 4ms vs 1.3ms on RTX4090.

As far as I know, cub uses decoupled look back to improve the scan operation speed. Any interest on making this more efficient by switching to the sota scan algorithm?

LRLVEC avatar Jan 19 '24 16:01 LRLVEC