Refine fallback kernel for segmented sort

Open gevtushenko opened this issue 3 years ago • 1 comments

Currently, cub::DeviceSegmentedSort has a fallback kernel, that apply different algorithms for different segment sizes. In particular, medium-size segments are sorted by merge sort. If segment doesn't fit into registers, it's sorted by in-shared-memory block radix sort. Otherwise, expensive in-global-memory radix sort is applied.

The suggestion was made, that we have to add small-segment branch and potentially experiment with refining the granularity at which we assign threads to segments at fallback kernel.

Jul 27 '22 23:07 gevtushenko

Here is analysis that kicked up this issue:

TEMS_PER_MEDIUM_THREAD impacts performance in a couple of ways.

if problem size is less than WARP_SIZE * ITEMS_PER_MEDIUM_THREAD the algorithm uses a register based merge sort algorithm. Once we exceed that we switch to the large policy. At the current thresholds it seems like we might switch too early. For example you see a huge jump when we cross over: 50 350 0.007516 50 400 0.017879
ITEMS_PER_MEDIUM_THREAD controls how many items each thread has. The merge sort is in-register and the full register set get's sorted even if the segment size per thread is smaller than ITEMS_PER_MEDIUM_THREAD. Having this number smaller makes the merge sort faster but also impacts the cross over point at which we choose to switch to radix sort.

Thus we have 2 competing goals with this item. 1) increase the cross over point to avoid an expensive radix sort on small problem sizes 2) decrease items per thread to increase merge sort size. The current design does not let us optimize both of these since one variable controls each decision.

Could this be fixed by adding runtime checks when num_items < ITEMS_PER_TILE to reduce the number of merge steps?

Jul 27 '22 23:07 luitjens