oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Implement vectorized brick for in-place exclusive scan

Open danhoeflinger opened this issue 2 years ago • 0 comments

As described in the comment below, it should be possible to implement vectorized in-place exclusive scan, rather than turning off vectorization for those cases.

          I agree that in-place exclusive scan cannot be expressed in a spec compliant way. However, I believe it is vectorizable, with the algorithm being like this (for the vector size of 4):
  1. capture __first[i..i+3] into a vector v[0..3]
  2. build a vector sv = {__init, __init+v[0], __init+v[0]+v[1], __init+v[0]+v[1]+v[2]}
  3. write sv to __result[i..i+3]
  4. __init += v[0]+v[1]+v[2]+v[3]
  5. i+=4

and it sort of matches the code that Dan has suggested:

for( int i=0; i<N; ++i) {
    auto v = __first[i];  // input phase?
    __result[i] = __init; // scan phase
    __init += v;          // input phase
}

Also note that in the above vectorization algorithm and a one for inclusive scan would only differ in the step 2, by shifting values in sv to the "left", and appending by the same value as computed in step 4.

UPD: Maybe the spec can be fixed by e.g. allowing intra-iteration dependencies from the scan phase to the input phase. Obviously, that would only apply to the exclusive scan, as in the inclusive scan the order of phases is opposite and such dependencies cannot exist. But I have not thought how that would affect more complex code patterns as well as implementations for parallel loops.

Originally posted by @akukanov in https://github.com/oneapi-src/oneDPL/pull/1037#discussion_r1327130006

danhoeflinger avatar Sep 18 '23 14:09 danhoeflinger