Pekka Jääskeläinen
Pekka Jääskeläinen
Indeed, the parallel regions are pocl's way of producing implicit work-group level vectorization in a modular way. It creates parallel for loops out of the work-items and annotates them so...
Loop interchange is supposed to take care of this decision automatically based on the memory access patterns. What if vectorizing over the "outer loop" (work items) is not the best...
Or further: Let's say the kernel has multiple parallel regions (parts isolated with barriers) with each having different memory access patterns, inner loops, and thus decisions whether to do outerloop...
You can disable the "outer loop vectorization by default" behavior by removing "implict-loop-barriers" from the pass list in pocl_llvm_api.cc to see if it makes any difference in the cases of...
#340 refers to this same issue, I believe.
I think the issue is that the outer loop vectorization fails (albeit injecting the parallel loop inside the inner loop properly) due to the mysterious value that is being used...
Perhaps a good basic heuristics for when to attempt outer loop vectorizing is to attempt it by default (like it now does but fails with LLVM 3.8 due to the...
I studied this case a bit, here's what I found: - The beneficiality of applying the outer loop vectorization is dependent on how long the inner loop body is. Because...
Oh. Why the double8 version doesn't autovectorize further is because LLVM's LV freaks out of the double8 that is being passed to the loop (via a PHI node). It has...
See d622499. That env should be used in this case for the time being.