cccl [BUG]: Either add a public facing header for `<algorithm>` or move `<algorithm>` implementations into a detail or experimental namespace

Today:

Many <algorithm> implementations exist in cuda::std::, which is a public facing namespace.
They can be included with <cuda/std/__algorithm_> or <cuda/std/__algorithm/${ALGO}.h (ALGO=fill, ALGO=equal), which are not public facing headers.
<cuda/std/algorithm>, a public facing header, does not exist.-
<cuda/algorithm> is a public facing header and exposes cuda::fill_bytes, cuda::copy_bytes, etc, but does not include any of the <algorithm> implementations in cuda::std::.
Some cuda::std:: <algorithm> facilities are used in other public facing headers. This is weird and bad. For example, if you include <cuda/std/array>, you get cuda::std::equal.

We cannot have it both ways. Pick one of the following two options:

If the cuda::std:: <algorithm> implementations that exist today are not ready for public exposure, then put them in a detail or experimental namespace.
If the cuda::std:: <algorithm> implementations are ready for public exposure, then provide public headers for them, e.g. <cuda/std/algorithm>. Remember, we have had partial header implementations in <cuda/std/*> in the past.

Nov 11 '25 19:11 brycelelbach

FYI, we have the last piece here: #3741. The holdup is the recursive implementation of the sorting algorithms, which leads to terrible code on a GPU. We have ideas on how to fix it. Thrust has a serial radix sort implementation. We could also just do a simple insertion sort. We just never found the time to push it over the finish line. @miscco is working on it.

Nov 11 '25 20:11 bernhardmgruber

I would disagree that we must publicly provide a partially implemented feature just because we use parts of it somewhere.

I agree that we should have <cuda/std/algorithm>, it is as always just a question of time constraints.

The sorting algorithms the C++ standard libraries provide and that we inherit from libc++ are relatively hostile to GPU execution.

That means we would have to rewrite one of the most complex and performance sensitive algorithms out there. This is neither trivial nor worth it right now.

I am strongly considering just leaving all sort related algorithms unimplemented and keep that as an open issue.

I also really want to write something about or stable but not stable header architecture.

It is incredibly annoying to include 200k LoC for just cuda::std::max , we should be able to allow our users to include <cuda/std/__algorithm/meow.h>. Cursed be the extensionless headers

Nov 12 '25 09:11 miscco

I have updated #3741

To not expose the sorting algorithms but publicize everything else

Nov 12 '25 10:11 miscco