pocl Outerloop vectorizer fail

This kernel here:

https://gist.github.com/inducer/8f7cd72829c85acc1d3fcb9c4a5dae05

gets vectorized into full-width 256-bit vectors without a problem by both Intel CL and ispc. (Note how the workgroup size already conveniently matches the expected vector width). Yet, pocl seems to generate scalar SSE code. How might I convince ti to do more on the vectorization front?

cc @lcw

May 12 '16 23:05 inducer

FWIW, I'm on f68ffcc on a Xeon E5-2620v3.

May 12 '16 23:05 inducer

Interesting test case for outer loop vectorization as vectorizing across the X dims should lead to nice wide vec accesses. Hopefully I have time to check why it doesn't vectorize at some point.

Meanwhile, if you wish to take a look, all the intermediate results from the compiler can be left over for inspection, and you should see in LLVM IR what's wrong. My first suspect is that the outer loop vectorization doesn't apply, i.e., it cannot analyze the uniformity of the loop variables for a reason or another.

ImplicitLoopBarrier.cc at about 154 tries to inject an implicit barrier to force a parallel loop inside the sequential loop to accomplish this, my guess is that the VUA.isUniform() fails to prove the loop counts are uniform across WIs. Enabling debug printouts in VUA should give some insight.

May 13 '16 07:05 pjaaskel

See d622499. This case also revealed another issue which we hopefully have time to fix sooner than later. The iteration variables are scalar expanded which in some cases might lead to unvectorizable outer loops. This seems to be one of the cases.

Sep 20 '16 07:09 pjaaskel

Thanks for looking into this!

Sep 22 '16 19:09 inducer