Outerloop vectorizer fail
This kernel here:
https://gist.github.com/inducer/8f7cd72829c85acc1d3fcb9c4a5dae05
gets vectorized into full-width 256-bit vectors without a problem by both Intel CL and ispc. (Note how the workgroup size already conveniently matches the expected vector width). Yet, pocl seems to generate scalar SSE code. How might I convince ti to do more on the vectorization front?
cc @lcw
FWIW, I'm on f68ffcc on a Xeon E5-2620v3.
Interesting test case for outer loop vectorization as vectorizing across the X dims should lead to nice wide vec accesses. Hopefully I have time to check why it doesn't vectorize at some point.
Meanwhile, if you wish to take a look, all the intermediate results from the compiler can be left over for inspection, and you should see in LLVM IR what's wrong. My first suspect is that the outer loop vectorization doesn't apply, i.e., it cannot analyze the uniformity of the loop variables for a reason or another.
ImplicitLoopBarrier.cc at about 154 tries to inject an implicit barrier to force a parallel loop inside the sequential loop to accomplish this, my guess is that the VUA.isUniform() fails to prove the loop counts are uniform across WIs. Enabling debug printouts in VUA should give some insight.
See d622499. This case also revealed another issue which we hopefully have time to fix sooner than later. The iteration variables are scalar expanded which in some cases might lead to unvectorizable outer loops. This seems to be one of the cases.
Thanks for looking into this!