oneDPL
oneDPL copied to clipboard
Tune vectorized binary_search, lower_bound, and upper_bound algorithms
This PR introduces some additional tuning for the vectorized search algorithms (binary_search, lower_bound, and upper_bound) where a series of searches across a provided set of search keys are performed on a single input buffer.
The improvements are explained below:
- In the current implementations,
__pstl_lower_bound
/__pstl_upper_bound
have been passed a 64-bit size type to compute indicies in its binary search. For inputs with a large number of search keys, this is a bottleneck in performance and rooflines have shown that this kernel is bound by 64-bit arithmetic. This PR introduces a 2-kernel solution where a 32-bit unsigned type is used for index calculation if there are less than2^32
search keys with a fallback to 64-bit offset calculation otherwise. This results in a performance improvement across these algorithms of~1.3x
on GPU Series Max 1550 for large inputs. - The second improvement is in the implementation of
__pstl_lower_bound
. Simplifying the loop body by checking descending powers-of-2 as opposed to continuously dividing the input size by2
reduces the amount of total arithmetic in the compiled code and results in a performance improvement of10-15%
for large inputs on top of the changes in1
. This improvement requires the size type to be unsigned and is implemented as a new utility function:__shars_lower_bound
.