oneDPL
oneDPL copied to clipboard
Use ND-range kernel in parallel_for kernel implementation to improve performance
Performance issues have been identified within our parallel_for
kernel on Intel® Data Center GPU Max for input sizes that are not powers of two. The root cause of this has been identified as the usage of a SYCL basic parallel kernel in our implementation.
This PR switches to using an explicit nd-range kernel for GPU devices only within the parallel_for
pattern. A grid-strided memory access pattern is used which successfully resolves the identified performance issues. Additionally, performance improvements have been observed for powers of 2 as well.
Algorithms that internally rely on this kernel should see improvements on GPUs (ex: for_each
, copy
, etc.).