oneDPL Enable vectorized global loads for the reduction algorithms

Enable vectorized global loads for the reduction algorithms

Open julianmi opened this issue 10 months ago • 1 comments

Vectorization is performance critical on SIMD architectures. This patch enables vectorization by unrolling vector size wide loop iterations on both coalesced (commutative algorithms) and consecutive (non-commutative algorithms) loads. Coalesced loads will then load vectors of consecutive elements. This change improves the coalesced loads on Intel SIMD GPUs without decreasing the throughput on SIMT GPUs. Coalesced loads are therefore enabled on SPIR-V backends as well. min_element and max_element continue using consecutive loads on SPIR-V backends due to the performance penalty of the required index check when using coalesced global loads.

Secondly, the vectorization enables dynamic number of elements to be processed per work-item. Launch parameter tuning with compile time constants is therefore not needed anymore. This reduces the number of template instantiations from 13 to 3, which improves the compile times significantly (e.g., half the time for sycl_iterator_reduce.pass).

Thirdly, branch divergence is minimized by adding a flag showing whether the work-group can process full sequences of the input array. If so, branching withing the inner kernel can be removed. If not, all work-items in a group follow the same boundary-checked implementation.

Mar 27 '24 10:03 julianmi

Given the complexity of the indexing changes and the limited user benefits, I think it's best to postpone this to the 2022.7 release.

Apr 26 '24 17:04 julianmi

@julianmi, how do you think, should we introduce some type for the union

    union __storage
    {
        _Tp __v;
        __storage() {}
    };

May 21 '24 13:05 SergeyKopienko

@julianmi, how do you think, should we introduce some type for the union
    union __storage
    {
        _Tp __v;
        __storage() {}
    };
?

I've added a union type to reduce the code duplication.

May 21 '24 14:05 julianmi

oneDPL oneDPL copied to clipboard

Enable vectorized global loads for the reduction algorithms

oneDPL
oneDPL copied to clipboard