oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call

Open mmichel11 opened this issue 4 months ago • 0 comments

Summary

This PR implements a SYCL backend reduce_by_segment by using higher level calls to reduce-then-scan along with new specialty functors to achieve a segmented reduction. This PR is an initial step of porting the implementation to reduce-then-scan with optimization likely to follow. Future efforts may include additional modification to reduce-then-scan kernels.

Performance improves for all input sizes. For small inputs, we see 3-5x improvements and for very large sizes ~1.25x on GPU Series Max 1550. Please contact me if you would like to see performance data.

Description of changes

  • The SYCL reduce_by_segment implementation that was previously handwritten is replaced by a higher level call to our reduce-then-scan kernels. Several new callback functors for the reduce-then-scan kernel have been made to achieve this operation.
  • reduce_by_segment.pass was encountering linker crashes due to the large number of test cases being compiled growing past the maximum size of the binary's data region. SYCL testing has been trimmed down with regards to USM device and shared testing which resolves this issue. Instead of running each test with a device and shared USM allocation, every other test switches the USM type.
  • ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION has been removed as the SYCL implementation has been replaced, and we are no longer impacted by this issue.
  • The legacy reduce_by_segment implementation is used as a fallback for when the sub-group size, device, and trivial copyability constraints cannot be satisfied.

Future work

Future efforts on reduce_by_segment may built on top of this implementation and the reduce-then-scan kernels to better handle first and last element cases.

mmichel11 avatar Oct 22 '24 19:10 mmichel11