thrust
thrust copied to clipboard
Add missing hard algorithms
We're missing these algorithms (which look hard);
partial_sort partial_sort_copy includes nth_element
Forwarded from http://code.google.com/p/thrust/issues/detail?id=423
NVC++ needs implementations of these algorithms.
Implementing a partial_sort algorithm would be one route for RAPIDS libcudf to support "top K" queries as described here: https://github.com/dask-contrib/dask-sql/issues/818. The current implementation of nsmallest / nlargest is to perform a full sort of indices, slice those indices to the desired subset, and gather the results. A partial sort could theoretically reduce the work of the full sort, but it is probably hard to do work-reduction when also computing in parallel.
Also cc-ing @mythrocks for interest in nth_element.
I implemented a version of nth_element a while ago (paper, code) whose primitives would also be relevant for partial_sort. Some notes on how I would approach this:
- Maybe: Take a random sample from the input data and sort it (or just use uniform buckets based on the MSB like Radix Sort)
- Compute a histogram based on the sorted sample (or a subset thereof). For integers, again RadixSort probably already provides the relevant kernel.
- Compute a prefix sum over the histogram counts to get the sorted ranks of the input histogram bucket boundaries.
- Identify the bucket that contains the nth smallest element, collect its elements from the input (nth_element) or the element from it and all smaller buckets (partial_sort) and run the same algorithm again recursively on only these smaller buckets. Probably falling back to sorting all remaining elements will become more efficient quickly.
On the individual kernels:
- For determining the histogram bucket, if you don't use the RadixSort approach, by storing the splitters as a search tree in Eytzinger layout in shared memory, you can implement a branch-free classification function like in super-scalar sample sort. Theoretically the classification can even be unrolled to interleave multiple comparisons, but that didn't show major benefits in my experiments in the past.
- If determining the histogram bucket is costly, we can memoize the bucket indices as bytes to reuse in the subsequent collection kernel
- Doing a two-level histogram accumulation (first in thread blocks, then in the grid) with a prefix sum computation over each entry of the histogram allows us to compute the output range for each histogram bucket and each thread block in parallel, which is again helpful in the subsequent collection kernel, as that will only require shared-memory atomics.
For arbitrarily complex user-provided types or comparators, most of the optimizations in here will probably not work well, since they rely heavily on fast shared memory. Also the randomized nature of the algorithm is a bit of a challenge. For even more randomization, we can also pick only two splitters from the sample that enclose the sample quantile we are looking for, and copy only those elements inside via a tripartition. The resulting algorithm may be much faster, but also has a worse worst-case complexity. We could of course always fall back to a full sorting algorithm if the recursion doesn't sufficiently reduce the input size.