thrust icon indicating copy to clipboard operation
thrust copied to clipboard

Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators

Open davidwendt opened this issue 3 years ago • 3 comments

In the libcudf component of RAPIDS we have a sort API that calls thrust::sort and thrust::stable_sort using a custom comparator for columns of data. Reference libcudf calling sort/stable_sort: https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/src/sort/sort_impl.cuh

As we add more column data types, the row-comparator gets a little more complex and the compile time and code size has increased dramatically. The problem appears to be the aggressive inlining of calls to the comparator as can be seen in this simple godbolt example: https://godbolt.org/z/hhachG (Note: I used std::sqrt() here only to illustrate how many times the comparator is inlined).

Tracing through the source I found some #pragma unroll statements in thrust/system/cuda/sort.h like the following: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L111-L116

#pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
    {
      bool p = (keys2_beg < keys2_end) &&
               ((keys1_beg >= keys1_end) ||
                compare_op(key2,key1));
...

I believe the ITEMS_PER_THREAD value is ~10 normally but there are a couple other unrolls including this one: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L353-L354

#pragma unroll
        for (int coop = 2; coop <= BLOCK_THREADS; coop *= 2)

where BLOCK_THREADS I believe is > 100 (e.g. 128, 256, or 512 as far as I can tell).

In general, small comparators here have no issue and these #pragma unroll statements likely provide a performance boost. But as the comparator size increases, the code size increases and performance actually starts to suffer.

Using the godbolt example above, I created 10 different programs each with slightly bigger comparators (just adding more std::sqrt() calls). I measured compile time, file size and execution time with original sort.h against a modified sort.h with some of unroll statements disabled (on just for-loops that called the comparator).

unrolled_compile_time

unrolled_file_size

unrolled_measured

The execution time was measured using nsys and captured the call to thrust::sort followed by a call to cudaStreamSynchronize(0).

  nvtxRangePushA("mysort");
  thrust::sort( thrust::device, d_vin.begin(), d_vin.end(), comparator{});
  cudaStreamSynchronize(0);
  nvtxRangePop(); 

Note that simple comparators will run faster (bottom-left of the last graph above) but even just a few extra statements in the comparator can cause it to run slower with unrolled for-loops.

davidwendt avatar Nov 20 '20 21:11 davidwendt

Thanks for the detailed write up! I'll be expanding our benchmarking / performance regression suite over the next few releases. It sounds like we should benchmark the sort algorithms with a variety of comparators so we can tune this and monitor for regressions afterwards.

alliepiper avatar Nov 20 '20 21:11 alliepiper

We need to benchmark removing __forceinline__ and #pragma unroll from everywhere in CUB.

brycelelbach avatar Dec 04 '20 22:12 brycelelbach

For posterity, I want to document that there is an easy way to work around this problem.

You can annotate your custom comparator with a __noinline__ which will circumvent CUB/Thrust's overuse of unrolling/forceinling.

From the original example,

struct comparator {
  __noinline__ __device__ bool operator()(double lhs, double rhs)
  {
    return std::sqrt(lhs+rhs) < 1.0;
  }
};

https://godbolt.org/z/Txe4Penc8

We should still explore more robust options for allowing users to configure and control this behavior, but I think this should unblock many who run into similar problems.

jrhemstad avatar Jun 23 '22 17:06 jrhemstad