cub BlockRadixSort needs overloads that take the problem size and correctly sets the padding value for unused inputs

I am trying cub::BlockRadixSort with PyTorch, it is getting good performance, but I find it is hard to use:

For example, if I want to sort 1023 elements, then I would use 256 threads and 4 elements for each threads to get a total of 1024 elements. So I have to add one padding element.

Assuming that cub::BlockRadixSort is stable (please confirm and document it !!!), if the key type is int, then I can just pad with INT_MAX and after sort, the padding value will remain in the last position and I can just discard it.

But if the key type is float, it becomes more complicated. I have to pad with nan in order to make the padding value to remain at the end, because after bit transformation, the largest number is nan. According to IEEE 754, there are multiple different nans, and I need to make sure that my padding nan is no smaller than the nans in user's input. Doing so requires me to assume on the implementation detail of how cub transform bits, and it might not be forward compatible if cub changes how bits are transformed.

A PyTorch program to demonstrate this problem is:

>>> a = torch.full((4096,), math.nan, device='cuda')
>>> a.sort()
torch.return_types.sort(
values=tensor([nan, nan, nan,  ..., nan, nan, nan], device='cuda:0'),
indices=tensor([   0,    1,    2,  ..., 4093, 4094, 4095], device='cuda:0'))
>>> a.view(torch.int)  # reinterpret bits as int
tensor([2143289344, 2143289344, 2143289344,  ..., 2143289344, 2143289344,
        2143289344], device='cuda:0', dtype=torch.int32)
>>> a.view(torch.int)[0].fill_(2147483647)
tensor(2147483647, device='cuda:0', dtype=torch.int32)
>>> a.view(torch.int)
tensor([2147483647, 2143289344, 2143289344,  ..., 2143289344, 2143289344,
        2143289344], device='cuda:0', dtype=torch.int32)
>>> a.sort()
torch.return_types.sort(
values=tensor([nan, nan, nan,  ..., nan, nan, nan], device='cuda:0'),
indices=tensor([   1,    2,    3,  ..., 4094, 4095,    0], device='cuda:0'))

Both 2143289344 and 2147483647 are nans when reinterpreting bits as float, but 2147483647 is a larger nan.

To make it easier to find the correct padding values, I suggest the following change:

We can either:

Add an API that provides the largest and lowest padding values for radix sort and radix sort descending. Something like cub::limits<float>::largest_radix(). Or,
Modify the behavior of cub sorting to make sure that it is stable with respect to different nans, that is: sort([2143289344, 2147483647]) == [2143289344, 2147483647] != sort([2147483647, 2143289344]) == [2147483647, 2143289344]

Apr 28 '21 17:04 zasdfgbnm

cc: @ngimel

Apr 28 '21 17:04 zasdfgbnm

BlockRadixSort is stable, and the docs should be updated soon. That stability treats different NaN bit representations as independent values, and will reorder different NaNs.

Add an API that provides the largest and lowest padding values for radix sort and radix sort descending.

See cub::Traits<T>::LOWEST_KEY and cub::Traits<T>::MAX_KEY. These are used for this purpose in the cub::DeviceRadixSort implementation:

    static const UnsignedBits           LOWEST_KEY  = Traits<KeyT>::LOWEST_KEY;
    static const UnsignedBits           MAX_KEY     = Traits<KeyT>::MAX_KEY;

    UnsignedBits default_key = (IS_DESCENDING) ? LOWEST_KEY : MAX_KEY;

https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_radix_sort_downsweep.cuh#L114-L115 https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_radix_sort_downsweep.cuh#L504

That should work for now, let me know if it's not sufficient.

I think ideally, cub::BlockRadixSort should provide overloads that take the problem-size and handle the padding automatically. Users shouldn't have to worry about these low level details.

May 04 '21 17:05 alliepiper

I use this snippet in my code to sort values in shared or global memory, with automatic padding:

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, typename T>
__inline__  __device__
void blockSort(T* mem, const int numElements)
{
    // Specialize BlockLoad, BlockStore, and BlockRadixSort collective types
    typedef cub::BlockLoad<T, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoad;
    typedef cub::BlockStore<T, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStore;
    typedef cub::BlockRadixSort<T, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSort;

    // Allocate type-safe, repurposable shared memory for collectives
    __shared__ union {
        typename BlockLoad::TempStorage       load;
        typename BlockStore::TempStorage      store;
        typename BlockRadixSort::TempStorage  sort;
    } temp_storage;

    // Load the unsorted items
    T items[ITEMS_PER_THREAD];
    BlockLoad(temp_storage.load).Load(mem, items, numElements, std::numeric_limits<T>::max());
    __syncthreads();

    // Collectively sort the items
    BlockRadixSort(temp_storage.sort).Sort(items);
    __syncthreads();

    // Store the sorted items
    BlockStore(temp_storage.store).Store(mem, items, numElements);
    __syncthreads();
}

That seems to work quite nicely, though there are two problems:

You have to make sure that each thread owns enough elements, i.e. BLOCK_THREADS * ITEMS_PER_THREAD >= numElements. If you don't know the maximum number of elements you want to sort at compile time then this doesn't work.
You always sort BLOCK_THREADS * ITEMS_PER_THREAD values, so this might be inefficient if you only have very few values to sort.

@allisonvacanti Is there an easy way to sort a range of arbitrary length within one block? The only way I have so far is to set ITEMS_PER_THREAD to a large value, to make sure all ranges are less than BLOCK_THREADS * ITEMS_PER_THREAD elements. But obviously that has its limits and is inefficient if you have ranges of varying length in different blocks.

Aug 25 '21 09:08 yboetz

Hello, @yboetz! In the following PR, I've introduced a new agent for sorting variable-length arrays in the block scope. It performs radix-based sort just like the one we use in segmented sorting. Here's the usage example. The example also addresses the case of partially utilized tile.

Aug 25 '21 09:08 gevtushenko

If that suits your needs, we could expose this algorithm as a block-scope facility (something like cub::BlockSort).

Aug 25 '21 09:08 gevtushenko

I don't quite follow the example, but what you describe is exactly what I was thinking :). I quite like the block-scope facilities of CUB, but they are sometimes a bit limiting as you need to provide the number of elements at compile time. So it would be quite useful if you could expose an overload of cub::BlockSort that can handle arrays of arbitrary length.

Aug 25 '21 09:08 yboetz

See cub::Traits<T>::LOWEST_KEY and cub::Traits<T>::MAX_KEY. These are used for this purpose in the cub::DeviceRadixSort implementation:

Perhaps a minor point, but it would be a lot nicer if there were constants of type T, since that's what you need to pass to LoadBlock.

Jun 17 '22 14:06 peterbell10

If that suits your needs, we could expose this algorithm as a block-scope facility (something like cub::BlockSort).

this will be very much appreciated. Currently the agent implementation is a bit too complicated to use.

I have added a routine here: https://github.com/chenxuhao/GraphAIBench/blob/41ec6d85ff9af5acee1980919aff0f8bdd10b3e5/include/sort.cuh#L7 Please let me know if this makes sense to you.

Sep 17 '22 19:09 chenxuhao

@chenxuhao, I've attached a fixed version of your routine here. Currently, we are not ready to expose it though.

Sep 19 '22 10:09 gevtushenko

@chenxuhao, I've attached a fixed version of your routine here. Currently, we are not ready to expose it though.

thanks a lot!

Sep 19 '22 20:09 chenxuhao

@chenxuhao, I've attached a fixed version of your routine here. Currently, we are not ready to expose it though.

Also, I wonder if it is possible to create a similar routine like warp_sort which does warp-wise sorting instead of cta-wise? Basically, each warp works on sorting its own array.

If this is not currently available in CUB, then I suppose what I have to do is to create an Agent that does this warp-wise?

Oct 07 '22 15:10 chenxuhao

Hello, @chenxuhao! Does cub::WarpMergeSort work for you? I don't think you need and agent for that. Agent's are used to share implementation of device-scope algorithms in CUB.

Oct 10 '22 10:10 gevtushenko

Hello, @chenxuhao! Does cub::WarpMergeSort work for you? I don't think you need and agent for that. Agent's are used to share implementation of device-scope algorithms in CUB.

Thanks Georgy. I need warp-level sorting for variable length arrays. But cub::WarpMergeSort can only do fixed-length sorting, right? This is similar to the reason why we use this Agent for cta-level sorting which is because BlockMergeSort and BlockRadixSort can only do fixed-length sorting.

Oct 11 '22 13:10 chenxuhao

cub cub copied to clipboard

BlockRadixSort needs overloads that take the problem size and correctly sets the padding value for unused inputs

cub
cub copied to clipboard