noma comments

Results 16 comments of


                                            noma

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

Thanks, that seems to be the issue here. Within some bounds, the compiler does vectorise the fully unrolled loop within the kernel. Performance is still poor, though. Also `-cl-fast-relaxed-math` is...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

@pjaaskel Thanks for the insights. What I, in the role of an application developer using OpenCL on a SIMD-machine, would like to have is plain and simple outer loop vectorisation...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

@eschnett Thanks, I wasn't aware of the second issue. With OpenMP SIMD directives, I noticed the Intel compiler exchanging the division variant of the loop with a precise reciprocal from...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

@franz thanks for pointing out the OpenCL compiler options. The actual goal here is to achieve outer-"loop" vectorization across the work-items, i.e. the kernel being compiled into a sth. like...

auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets

@Kazhuu Thanks, that's very interesting, which loop exactly did you annotate? The inner loop in the kernel or the implicit outer loop, i.e. the code somewhere in PoCL that processes...

Use templated type for real_t

I thought about this when I added the double support, but did not want to break the API. I think a template solution only makes sense, if: - users are...