thrust
thrust copied to clipboard
Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators
In the libcudf component of RAPIDS we have a sort API that calls thrust::sort
and thrust::stable_sort
using a custom comparator for columns of data.
Reference libcudf calling sort/stable_sort: https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/src/sort/sort_impl.cuh
As we add more column data types, the row-comparator gets a little more complex and the compile time and code size has increased dramatically. The problem appears to be the aggressive inlining of calls to the comparator as can be seen in this simple godbolt example: https://godbolt.org/z/hhachG (Note: I used std::sqrt()
here only to illustrate how many times the comparator is inlined).
Tracing through the source I found some #pragma unroll
statements in thrust/system/cuda/sort.h
like the following: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L111-L116
#pragma unroll
for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
{
bool p = (keys2_beg < keys2_end) &&
((keys1_beg >= keys1_end) ||
compare_op(key2,key1));
...
I believe the ITEMS_PER_THREAD
value is ~10 normally but there are a couple other unrolls including this one: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L353-L354
#pragma unroll
for (int coop = 2; coop <= BLOCK_THREADS; coop *= 2)
where BLOCK_THREADS
I believe is > 100 (e.g. 128, 256, or 512 as far as I can tell).
In general, small comparators here have no issue and these #pragma unroll
statements likely provide a performance boost. But as the comparator size increases, the code size increases and performance actually starts to suffer.
Using the godbolt example above, I created 10 different programs each with slightly bigger comparators (just adding more std::sqrt()
calls). I measured compile time, file size and execution time with original sort.h
against a modified sort.h
with some of unroll statements disabled (on just for-loops that called the comparator).
The execution time was measured using nsys
and captured the call to thrust::sort
followed by a call to cudaStreamSynchronize(0)
.
nvtxRangePushA("mysort");
thrust::sort( thrust::device, d_vin.begin(), d_vin.end(), comparator{});
cudaStreamSynchronize(0);
nvtxRangePop();
Note that simple comparators will run faster (bottom-left of the last graph above) but even just a few extra statements in the comparator can cause it to run slower with unrolled for-loops.
Thanks for the detailed write up! I'll be expanding our benchmarking / performance regression suite over the next few releases. It sounds like we should benchmark the sort algorithms with a variety of comparators so we can tune this and monitor for regressions afterwards.
We need to benchmark removing __forceinline__
and #pragma unroll
from everywhere in CUB.
For posterity, I want to document that there is an easy way to work around this problem.
You can annotate your custom comparator with a __noinline__
which will circumvent CUB/Thrust's overuse of unrolling/forceinling.
From the original example,
struct comparator {
__noinline__ __device__ bool operator()(double lhs, double rhs)
{
return std::sqrt(lhs+rhs) < 1.0;
}
};
https://godbolt.org/z/Txe4Penc8
We should still explore more robust options for allowing users to configure and control this behavior, but I think this should unblock many who run into similar problems.