cccl [FEA]: Implement GPU friendly single-threaded sorting algorithms

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

We currently do not expose cuda::std:: sorting algorithms, because they are commonly implemented through recursion and that might blow up the GPU stack.

We need to spend time to implement a proper GPU friendly, single-threaded implementation to expose

sort
stable_sort
nth_element
inplace_merge

Describe the solution you'd like

There are a lot of nice sorting networks available, that are already in the libc++ implementation.

We should investigate if we can use some larger ones to properly implement a non-recursive sort on device

Note that we do not want to have implementation divergence between host and device

Also this is regarding serial algorithms. For parallel algorithms we will use the CUB backend

Describe alternatives you've considered

No response

Additional context

No response

Nov 12 '25 10:11 miscco

Also this is regarding serial algorithms.

I think it's also likely that other threads in the same warp are also running a serial sort on their (different) data. So I think we are also looking for a sort algorithm that minimizes divergence among neighboring threads. Ideally, the control flow would not depend on the data being sorted, which brings us to sorting networks again.

That's how I imagine this:

template <int ItemsPerThread>
__global__ void kernel(int* data) {
  data += (blockIdx.x * blockDim.x + threadIdx.x) * ItemsPerThread;
  int items[ItemsPerThread];
  for (int i = 0; i < ItemsPerThread; i++)
    items[i] = data[i];
  cuda::std::sort(items, items + ItemsPerThread); // each thread sorts it's own local data
  ...
}

Nov 12 '25 10:11 bernhardmgruber

Fun fact the current implementation already utilizes sorting networks of size 3, 4 and 5

Nov 12 '25 10:11 miscco

I was looking for cuda::std::sort in this PR https://github.com/NVIDIA/cccl/pull/6585. I ended up using cuda::std::partial_sort on the full range for a statistical test. Would be great to replace this with proper sort.

Nov 12 '25 11:11 RAMitchell

Note that we do not want to have implementation divergence between host and device

This should not be a hard requirement. An optimal sequential sort may very well be different on a CPU vs GPU thread.

Nov 12 '25 15:11 jrhemstad