oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Tune vectorized binary_search, lower_bound, and upper_bound algorithms

Open mmichel11 opened this issue 11 months ago • 3 comments

This PR introduces some additional tuning for the vectorized search algorithms (binary_search, lower_bound, and upper_bound) where a series of searches across a provided set of search keys are performed on a single input buffer.

The improvements are explained below:

  1. In the current implementations, __pstl_lower_bound / __pstl_upper_bound have been passed a 64-bit size type to compute indicies in its binary search. For inputs with a large number of search keys, this is a bottleneck in performance and rooflines have shown that this kernel is bound by 64-bit arithmetic. This PR introduces a 2-kernel solution where a 32-bit unsigned type is used for index calculation if there are less than 2^32 search keys with a fallback to 64-bit offset calculation otherwise. This results in a performance improvement across these algorithms of ~1.3x on GPU Series Max 1550 for large inputs.
  2. The second improvement is in the implementation of __pstl_lower_bound. Simplifying the loop body by checking descending powers-of-2 as opposed to continuously dividing the input size by 2 reduces the amount of total arithmetic in the compiled code and results in a performance improvement of 10-15% for large inputs on top of the changes in 1. This improvement requires the size type to be unsigned and is implemented as a new utility function: __shars_lower_bound.

mmichel11 avatar Mar 13 '24 21:03 mmichel11