Implement vectorized brick for in-place exclusive scan
As described in the comment below, it should be possible to implement vectorized in-place exclusive scan, rather than turning off vectorization for those cases.
I agree that in-place exclusive scan cannot be expressed in a spec compliant way. However, I believe it is vectorizable, with the algorithm being like this (for the vector size of 4):
- capture
__first[i..i+3]into a vectorv[0..3] - build a vector
sv = {__init, __init+v[0], __init+v[0]+v[1], __init+v[0]+v[1]+v[2]} - write
svto__result[i..i+3] -
__init += v[0]+v[1]+v[2]+v[3] -
i+=4
and it sort of matches the code that Dan has suggested:
for( int i=0; i<N; ++i) {
auto v = __first[i]; // input phase?
__result[i] = __init; // scan phase
__init += v; // input phase
}
Also note that in the above vectorization algorithm and a one for inclusive scan would only differ in the step 2, by shifting values in sv to the "left", and appending by the same value as computed in step 4.
UPD: Maybe the spec can be fixed by e.g. allowing intra-iteration dependencies from the scan phase to the input phase. Obviously, that would only apply to the exclusive scan, as in the inclusive scan the order of phases is opposite and such dependencies cannot exist. But I have not thought how that would affect more complex code patterns as well as implementations for parallel loops.
Originally posted by @akukanov in https://github.com/oneapi-src/oneDPL/pull/1037#discussion_r1327130006