oneDPL
oneDPL copied to clipboard
Enable vectorized global loads for the reduction algorithms
Vectorization is performance critical on SIMD architectures. This patch enables vectorization by unrolling vector size wide loop iterations on both coalesced (commutative algorithms) and consecutive (non-commutative algorithms) loads. Coalesced loads will then load vectors of consecutive elements. This change improves the coalesced loads on Intel SIMD GPUs without decreasing the throughput on SIMT GPUs. Coalesced loads are therefore enabled on SPIR-V backends as well. min_element
and max_element
continue using consecutive loads on SPIR-V backends due to the performance penalty of the required index check when using coalesced global loads.
Secondly, the vectorization enables dynamic number of elements to be processed per work-item. Launch parameter tuning with compile time constants is therefore not needed anymore. This reduces the number of template instantiations from 13 to 3, which improves the compile times significantly (e.g., half the time for sycl_iterator_reduce.pass
).
Thirdly, branch divergence is minimized by adding a flag showing whether the work-group can process full sequences of the input array. If so, branching withing the inner kernel can be removed. If not, all work-items in a group follow the same boundary-checked implementation.
Given the complexity of the indexing changes and the limited user benefits, I think it's best to postpone this to the 2022.7 release.
@julianmi, how do you think, should we introduce some type for the union
union __storage
{
_Tp __v;
__storage() {}
};
?
@julianmi, how do you think, should we introduce some type for the
union
union __storage { _Tp __v; __storage() {} };
?
I've added a union type to reduce the code duplication.