oneDPL
oneDPL copied to clipboard
Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call
Summary
This PR implements a SYCL backend reduce_by_segment
by using higher level calls to reduce-then-scan along with new specialty functors to achieve a segmented reduction. This PR is an initial step of porting the implementation to reduce-then-scan with optimization likely to follow. Future efforts may include additional modification to reduce-then-scan kernels.
Performance improves for all input sizes. For small inputs, we see 3-5x improvements and for very large sizes ~1.25x on GPU Series Max 1550. Please contact me if you would like to see performance data.
Description of changes
- The SYCL
reduce_by_segment
implementation that was previously handwritten is replaced by a higher level call to our reduce-then-scan kernels. Several new callback functors for the reduce-then-scan kernel have been made to achieve this operation. -
reduce_by_segment.pass
was encountering linker crashes due to the large number of test cases being compiled growing past the maximum size of the binary's data region. SYCL testing has been trimmed down with regards to USM device and shared testing which resolves this issue. Instead of running each test with a device and shared USM allocation, every other test switches the USM type. -
ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION
has been removed as the SYCL implementation has been replaced, and we are no longer impacted by this issue. - The legacy
reduce_by_segment
implementation is used as a fallback for when the sub-group size, device, and trivial copyability constraints cannot be satisfied.
Future work
Future efforts on reduce_by_segment
may built on top of this implementation and the reduce-then-scan kernels to better handle first and last element cases.