[FEA]: Implement GPU friendly single-threaded sorting algorithms
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
libcu++
Is your feature request related to a problem? Please describe.
We currently do not expose cuda::std:: sorting algorithms, because they are commonly implemented through recursion and that might blow up the GPU stack.
We need to spend time to implement a proper GPU friendly, single-threaded implementation to expose
-
sort -
stable_sort -
nth_element -
inplace_merge
Describe the solution you'd like
There are a lot of nice sorting networks available, that are already in the libc++ implementation.
We should investigate if we can use some larger ones to properly implement a non-recursive sort on device
Note that we do not want to have implementation divergence between host and device
Also this is regarding serial algorithms. For parallel algorithms we will use the CUB backend
Describe alternatives you've considered
No response
Additional context
No response
Also this is regarding serial algorithms.
I think it's also likely that other threads in the same warp are also running a serial sort on their (different) data. So I think we are also looking for a sort algorithm that minimizes divergence among neighboring threads. Ideally, the control flow would not depend on the data being sorted, which brings us to sorting networks again.
That's how I imagine this:
template <int ItemsPerThread>
__global__ void kernel(int* data) {
data += (blockIdx.x * blockDim.x + threadIdx.x) * ItemsPerThread;
int items[ItemsPerThread];
for (int i = 0; i < ItemsPerThread; i++)
items[i] = data[i];
cuda::std::sort(items, items + ItemsPerThread); // each thread sorts it's own local data
...
}
Fun fact the current implementation already utilizes sorting networks of size 3, 4 and 5
I was looking for cuda::std::sort in this PR https://github.com/NVIDIA/cccl/pull/6585. I ended up using cuda::std::partial_sort on the full range for a statistical test. Would be great to replace this with proper sort.
Note that we do not want to have implementation divergence between host and device
This should not be a hard requirement. An optimal sequential sort may very well be different on a CPU vs GPU thread.