cub
cub copied to clipboard
Allow iterators in cub::DeviceRadixSort
Currently, cub::DeviceRadixSort
only support operating on pointers
template<typename KeyT , typename ValueT >
static CUB_RUNTIME_FUNCTION
cudaError_t SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)
It would be good if the d_values_in
could be an iterator.
One use case is https://github.com/pytorch/pytorch/pull/53841, in this PR, we are working on a sorting problem where the input keys are random numbers, and input values are 0, 1, 2, 3, ..., N. Currently, we have to generate a memory buffer to store these 0, 1, 2, ..., N, which is not optimal. It would be nice if we can do something like:
cub::CountingInputIterator iter(0);
cub::DeviceRadixSort::SortPairs(..., /*d_values_in=*/iter, /*d_values_out=*/buffer, ...);
Have you tried the thrust::sort functions?, they allow to do what you are asking for. For example: https://thrust.github.io/doc/group__sorting_gaec4e3610a36062ee3e3d16607ce5ad80.html Just curious to know if you have some experience with timings for your use case between cub/thrust. I guess thrust is probably going to call cub anyway. If you are worried about temporary memory allocation when using thrust, you can use the policy argument, like here: https://github.com/NVIDIA/thrust/blob/main/examples/cuda/custom_temporary_allocation.cu
@RaulPPelaez Yes I am aware of thrust::sort
. I was actually using thrust::sort
and want to migrate to cub. I don't know the perf number, but thrust sort has some device host synchronizations which drives me this migration.
I agree that this should be done if possible and can look into prioritizing this after I finish updating our benchmarking infrastructure.
Pinging @dumerrill and @canonizer since they know the radix sorting code best -- are ya'll aware of any reasons that this wouldn't work? Other device algorithms support iterators, so I'm curious if there's a reason DeviceRadixSort
is different.
Pinging @dumerrill and @canonizer since they know the radix sorting code best -- are ya'll aware of any reasons that this wouldn't work? Other device algorithms support iterators, so I'm curious if there's a reason DeviceRadixSort is different.
The main difference being that DeviceRadixSort
is a multi-pass algorithm. One could account for taking an arbitrary iterator, but the implementation will require special treatment.
Currently, internally, DeviceRadixSort
uses a DoubleBuffer with two ptr members (swapping the two buffers with each sorting pass). To account for an iterator, the first sorting pass will have the iterator as an input and write the materialised results to memory. From then on, we can continue with the usual DoubleBuffer logic.
Two options to implement this:
- materialise the input iterator prior to the sort (iirc, this is what
thrust::sort
does): Less efficient but easy to implement. - have a special treatment on the first sorting pass: More implementation effort, longer compilation times (as first sorting pass has different template specialisation), but more efficient (saving N memory writes).
Thanks -- that's a good point.
The double buffer approach has some other issues, too -- the current implementation casts away the const-ness of the input and then writes to the const input memory as part of the double buffering. We should fix that when addressing this issue, since the fix would reuse the same code path.
I am working on this in https://github.com/NVIDIA/cub/pull/374