cub
cub copied to clipboard
BlockRadixSort needs overloads that take the problem size and correctly sets the padding value for unused inputs
I am trying cub::BlockRadixSort
with PyTorch, it is getting good performance, but I find it is hard to use:
For example, if I want to sort 1023 elements, then I would use 256 threads and 4 elements for each threads to get a total of 1024 elements. So I have to add one padding element.
Assuming that cub::BlockRadixSort
is stable (please confirm and document it !!!), if the key type is int
, then I can just pad with INT_MAX
and after sort, the padding value will remain in the last position and I can just discard it.
But if the key type is float
, it becomes more complicated. I have to pad with nan
in order to make the padding value to remain at the end, because after bit transformation, the largest number is nan
. According to IEEE 754, there are multiple different nan
s, and I need to make sure that my padding nan
is no smaller than the nan
s in user's input. Doing so requires me to assume on the implementation detail of how cub transform bits, and it might not be forward compatible if cub changes how bits are transformed.
A PyTorch program to demonstrate this problem is:
>>> a = torch.full((4096,), math.nan, device='cuda')
>>> a.sort()
torch.return_types.sort(
values=tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0'),
indices=tensor([ 0, 1, 2, ..., 4093, 4094, 4095], device='cuda:0'))
>>> a.view(torch.int) # reinterpret bits as int
tensor([2143289344, 2143289344, 2143289344, ..., 2143289344, 2143289344,
2143289344], device='cuda:0', dtype=torch.int32)
>>> a.view(torch.int)[0].fill_(2147483647)
tensor(2147483647, device='cuda:0', dtype=torch.int32)
>>> a.view(torch.int)
tensor([2147483647, 2143289344, 2143289344, ..., 2143289344, 2143289344,
2143289344], device='cuda:0', dtype=torch.int32)
>>> a.sort()
torch.return_types.sort(
values=tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0'),
indices=tensor([ 1, 2, 3, ..., 4094, 4095, 0], device='cuda:0'))
Both 2143289344
and 2147483647
are nan
s when reinterpreting bits as float, but 2147483647
is a larger nan.
To make it easier to find the correct padding values, I suggest the following change:
We can either:
- Add an API that provides the largest and lowest padding values for radix sort and radix sort descending. Something like
cub::limits<float>::largest_radix()
. Or, - Modify the behavior of cub sorting to make sure that it is stable with respect to different
nan
s, that is:sort([2143289344, 2147483647]) == [2143289344, 2147483647] != sort([2147483647, 2143289344]) == [2147483647, 2143289344]
cc: @ngimel
BlockRadixSort is stable, and the docs should be updated soon. That stability treats different NaN bit representations as independent values, and will reorder different NaNs.
Add an API that provides the largest and lowest padding values for radix sort and radix sort descending.
See cub::Traits<T>::LOWEST_KEY
and cub::Traits<T>::MAX_KEY
. These are used for this purpose in the cub::DeviceRadixSort
implementation:
static const UnsignedBits LOWEST_KEY = Traits<KeyT>::LOWEST_KEY;
static const UnsignedBits MAX_KEY = Traits<KeyT>::MAX_KEY;
UnsignedBits default_key = (IS_DESCENDING) ? LOWEST_KEY : MAX_KEY;
https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_radix_sort_downsweep.cuh#L114-L115 https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_radix_sort_downsweep.cuh#L504
That should work for now, let me know if it's not sufficient.
I think ideally, cub::BlockRadixSort
should provide overloads that take the problem-size and handle the padding automatically. Users shouldn't have to worry about these low level details.
I use this snippet in my code to sort values in shared or global memory, with automatic padding:
template <int BLOCK_THREADS, int ITEMS_PER_THREAD, typename T>
__inline__ __device__
void blockSort(T* mem, const int numElements)
{
// Specialize BlockLoad, BlockStore, and BlockRadixSort collective types
typedef cub::BlockLoad<T, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoad;
typedef cub::BlockStore<T, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStore;
typedef cub::BlockRadixSort<T, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSort;
// Allocate type-safe, repurposable shared memory for collectives
__shared__ union {
typename BlockLoad::TempStorage load;
typename BlockStore::TempStorage store;
typename BlockRadixSort::TempStorage sort;
} temp_storage;
// Load the unsorted items
T items[ITEMS_PER_THREAD];
BlockLoad(temp_storage.load).Load(mem, items, numElements, std::numeric_limits<T>::max());
__syncthreads();
// Collectively sort the items
BlockRadixSort(temp_storage.sort).Sort(items);
__syncthreads();
// Store the sorted items
BlockStore(temp_storage.store).Store(mem, items, numElements);
__syncthreads();
}
That seems to work quite nicely, though there are two problems:
- You have to make sure that each thread owns enough elements, i.e.
BLOCK_THREADS * ITEMS_PER_THREAD >= numElements
. If you don't know the maximum number of elements you want to sort at compile time then this doesn't work. - You always sort
BLOCK_THREADS * ITEMS_PER_THREAD
values, so this might be inefficient if you only have very few values to sort.
@allisonvacanti Is there an easy way to sort a range of arbitrary length within one block? The only way I have so far is to set ITEMS_PER_THREAD
to a large value, to make sure all ranges are less than BLOCK_THREADS * ITEMS_PER_THREAD
elements. But obviously that has its limits and is inefficient if you have ranges of varying length in different blocks.
Hello, @yboetz! In the following PR, I've introduced a new agent for sorting variable-length arrays in the block scope. It performs radix-based sort just like the one we use in segmented sorting. Here's the usage example. The example also addresses the case of partially utilized tile.
If that suits your needs, we could expose this algorithm as a block-scope facility (something like cub::BlockSort
).
I don't quite follow the example, but what you describe is exactly what I was thinking :). I quite like the block-scope facilities of CUB, but they are sometimes a bit limiting as you need to provide the number of elements at compile time. So it would be quite useful if you could expose an overload of cub::BlockSort
that can handle arrays of arbitrary length.
See
cub::Traits<T>::LOWEST_KEY
andcub::Traits<T>::MAX_KEY
. These are used for this purpose in thecub::DeviceRadixSort
implementation:
Perhaps a minor point, but it would be a lot nicer if there were constants of type T
, since that's what you need to pass to LoadBlock
.
If that suits your needs, we could expose this algorithm as a block-scope facility (something like
cub::BlockSort
).
this will be very much appreciated. Currently the agent implementation is a bit too complicated to use.
I have added a routine here: https://github.com/chenxuhao/GraphAIBench/blob/41ec6d85ff9af5acee1980919aff0f8bdd10b3e5/include/sort.cuh#L7 Please let me know if this makes sense to you.
@chenxuhao, I've attached a fixed version of your routine here. Currently, we are not ready to expose it though.
@chenxuhao, I've attached a fixed version of your routine here. Currently, we are not ready to expose it though.
thanks a lot!
@chenxuhao, I've attached a fixed version of your routine here. Currently, we are not ready to expose it though.
Also, I wonder if it is possible to create a similar routine like warp_sort which does warp-wise sorting instead of cta-wise? Basically, each warp works on sorting its own array.
If this is not currently available in CUB, then I suppose what I have to do is to create an Agent that does this warp-wise?
Hello, @chenxuhao! Does cub::WarpMergeSort
work for you? I don't think you need and agent for that. Agent's are used to share implementation of device-scope algorithms in CUB.
Hello, @chenxuhao! Does
cub::WarpMergeSort
work for you? I don't think you need and agent for that. Agent's are used to share implementation of device-scope algorithms in CUB.
Thanks Georgy. I need warp-level sorting for variable length arrays. But cub::WarpMergeSort
can only do fixed-length sorting, right? This is similar to the reason why we use this Agent for cta-level sorting which is because BlockMergeSort and BlockRadixSort can only do fixed-length sorting.