oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Use ND-range kernel in parallel_for kernel implementation to improve performance

Open mmichel11 opened this issue 1 year ago • 3 comments

Performance issues have been identified within our parallel_for kernel on Intel® Data Center GPU Max for input sizes that are not powers of two. The root cause of this has been identified as the usage of a SYCL basic parallel kernel in our implementation.

This PR switches to using an explicit nd-range kernel for GPU devices only within the parallel_for pattern. A grid-strided memory access pattern is used which successfully resolves the identified performance issues. Additionally, performance improvements have been observed for powers of 2 as well.

Algorithms that internally rely on this kernel should see improvements on GPUs (ex: for_each, copy, etc.).

mmichel11 avatar Jun 20 '23 16:06 mmichel11