compute
compute copied to clipboard
the peformance of radix sort is lower than cub or hipPRIM
if we have do some work to improve the performance of radix_sort_by_key( ), as i tested , the perf is 11ms per 1m element size. while ~1.15ms in rocmPRIM(OpenCL) and CUB(cuda) per 1M elements
Yeah, the performance of this algorithm was not improved for some time. We should check how rocPRIM does it and try similar things.
btw. rocPRIM is not implemented in OpenCL. It's HIP and HC (AMD's C++AMP).