Enabling reduce_then_scan for "Set" family of scan APIs
Enabling reduce then scan algorithm to be used for set family APIs:
set_intersection
set_union
set_difference
set_symmetric_difference
Adds high-level helpers required to convert set family to use reduce_then_scan on GPU with size 32 subgroups.
Also makes set consistent with the other scan algorithms to select algorithm in the __parallel_* level of function call.
This provides significant performance improvements, especially at small sizes of n. At larger sizes of n, this is an improvement over the existing algorithm, but not by a lot.
There is still significant room for improvement here, this algorithm is very inefficient.
Three future things to try (from easiest to hardest) to improve the set family within the reduce then scan alg:
- We could explore our alternative
__pstl_lower_bound,__shars_lower_boundto see if it provides benefit. - We could propagate a "hint" minimum search index between blocks (this provides a lower bound and better "blocks" the second set, in addition to the blocking we do inherently on the first set)
- We could propagate a similar "hint" minimum search index subgroup from one iteration to the next. (This prevents redundant computation for consecutive iterations for a subgroup. Since the sets are ordered, subsequent searches will never go backward as long as the element we are searching's index in the first increases, which it does).
Right now, we are wasting lots of memory accesses and comparisons on sections of the second set which should be able to ruled out from knowledge we have access to at the time we are making the search (in the "Reduce" predicate). We also lose the "blocking" cache advantage of reduce_then_scan, because the second set always searches from the start of the list.
I realize I never expanded the set of reviewers from a small group, though I know we agreed not to attempt to spend too much effort on perfecting this approach (for performance) before merging. I've now added a few more for visibility.