Pekka Jääskeläinen comments

Results 381 comments of


                                            Pekka Jääskeläinen

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

Indeed, the parallel regions are pocl's way of producing implicit work-group level vectorization in a modular way. It creates parallel for loops out of the work-items and annotates them so...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

Loop interchange is supposed to take care of this decision automatically based on the memory access patterns. What if vectorizing over the "outer loop" (work items) is not the best...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

Or further: Let's say the kernel has multiple parallel regions (parts isolated with barriers) with each having different memory access patterns, inner loops, and thus decisions whether to do outerloop...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

You can disable the "outer loop vectorization by default" behavior by removing "implict-loop-barriers" from the pass list in pocl_llvm_api.cc to see if it makes any difference in the cases of...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

#340 refers to this same issue, I believe.

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

I think the issue is that the outer loop vectorization fails (albeit injecting the parallel loop inside the inner loop properly) due to the mysterious value that is being used...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

Perhaps a good basic heuristics for when to attempt outer loop vectorizing is to attempt it by default (like it now does but fails with LLVM 3.8 due to the...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

I studied this case a bit, here's what I found: - The beneficiality of applying the outer loop vectorization is dependent on how long the inner loop body is. Because...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

Oh. Why the double8 version doesn't autovectorize further is because LLVM's LV freaks out of the double8 that is being passed to the loop (via a PHI node). It has...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

See d622499. That env should be used in this case for the time being.