noma
noma
Thanks, that seems to be the issue here. Within some bounds, the compiler does vectorise the fully unrolled loop within the kernel. Performance is still poor, though. Also `-cl-fast-relaxed-math` is...
@pjaaskel Thanks for the insights. What I, in the role of an application developer using OpenCL on a SIMD-machine, would like to have is plain and simple outer loop vectorisation...
@eschnett Thanks, I wasn't aware of the second issue. With OpenMP SIMD directives, I noticed the Intel compiler exchanging the division variant of the loop with a precise reciprocal from...
@franz thanks for pointing out the OpenCL compiler options. The actual goal here is to achieve outer-"loop" vectorization across the work-items, i.e. the kernel being compiled into a sth. like...
@Kazhuu Thanks, that's very interesting, which loop exactly did you annotate? The inner loop in the kernel or the implicit outer loop, i.e. the code somewhere in PoCL that processes...
I thought about this when I added the double support, but did not want to break the API. I think a template solution only makes sense, if: - users are...