pocl
pocl copied to clipboard
auto vectoriser generates S̶S̶E̶ scalar code on AVX2 and AVX512 targets
I looked at the generated assembler code for auto vectorisation across work-items, and for manual vectorisation using 4 and 8 width vector types.
file kernel_auto.cl
__kernel void benchmark_op(
__global double const* restrict in_a,
__global double const* restrict in_b,
__global double* restrict out)
{
int id = get_global_id(0);
double tmp_a = in_a[id];
double tmp_b = in_b[id];
#pragma unroll(100)
for (int i = 0; i < 10000; ++i)
tmp_a = tmp_b + tmp_a;
out[id] = tmp_a;
}
export POCL_DEBUG=1
poclcc kernel_auto.cl
# relevant part of the output
[2016-07-18 18:48:53.843772262] POCL: in fn llvm_codegen at line 120:
*** INFO *** executing [$HOME/Software/llvm_3.8.1/bin/clang++ --target=x86_64-unknown-linux-gnu -D_CL_DISABLE_HALF -shared -lm -o $HOME/.cache/pocl/kcache/EE/AGEBDCFOKAELMKFALHFAEEGBKFEBNANBDBBAL/benchmark_op/0-0-0/benchmark_op.so $HOME/.cache/pocl/kcache/EE/AGEBDCFOKAELMKFALHFAEEGBKFEBNANBDBBAL/benchmark_op/0-0-0/benchmark_op.so.o]
# disassemble
objdump -d $HOME/.cache/pocl/kcache/EE/AGEBDCFOKAELMKFALHFAEEGBKFEBNANBDBBAL/benchmark_op/0-0-0/benchmark_op.so
# ...
00000000000007d0 <_pocl_launcher_benchmark_op>:
# unrolled loop
860: 62 f1 ff 08 58 c9 vaddsd %xmm1,%xmm0,%xmm1
866: 62 f1 ff 08 58 c9 vaddsd %xmm1,%xmm0,%xmm1
86c: 62 f1 ff 08 58 c9 vaddsd %xmm1,%xmm0,%xmm1
872: 62 f1 ff 08 58 c9 vaddsd %xmm1,%xmm0,%xmm1
878: 62 f1 ff 08 58 c9 vaddsd %xmm1,%xmm0,%xmm1
#...
The registers should be ymm or zmm for AVX2 and AVX512, respectively.
Same kernel with manual vectorisation, to be started with 8 times less work-items, giving the exact same results as above. File: kernel_manual_8.cl
__kernel void benchmark_op(
__global double8 const* restrict in_a,
__global double8 const* restrict in_b,
__global double8* restrict out)
{
int id = get_global_id(0);
double8 tmp_a = in_a[id];
double8 tmp_b = in_b[id];
#pragma unroll(100)
for (int i = 0; i < 10000; ++i)
tmp_a = tmp_b + tmp_a;
out[id] = tmp_a;
}
Same procedure as for auto, assembler looks like
# unrolled loop
860: 62 f1 fd 48 58 c9 vaddpd %zmm1,%zmm0,%zmm1
866: 62 f1 fd 48 58 c9 vaddpd %zmm1,%zmm0,%zmm1
86c: 62 f1 fd 48 58 c9 vaddpd %zmm1,%zmm0,%zmm1
872: 62 f1 fd 48 58 c9 vaddpd %zmm1,%zmm0,%zmm1
878: 62 f1 fd 48 58 c9 vaddpd %zmm1,%zmm0,%zmm1
# ...
To my understanding, the autovectoriser should directly leverage whatever llvm is capable of, and some vectorisation takes place. Maybe there is just some target specification missing and it falls back to SSE on that path. But I am not familiar enough with the code to figure that out quickly.
Note: This is not really SSE code, this is actually scalar code. I don't think architecture detection is the problem here, it's rather that the code is not vectorized at all.
Any ideas on how to make it vectorize?
Can't see why OpenCL with auto vectorisation across work-items would not vectorise that, or maybe I have a wrong idea on PoCL's strategy here.
export POCL_VECTORIZER_REMARKS=1
poclcc kernel_auto.cl
remark: <unknown>:0:0: loop not vectorized: cannot prove it is safe to reorder floating-point operations
remark: <unknown>:0:0: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info
But no idea how to pass that through: -Rpass-analysis=loop-vectorize
Does not work as OpenCL compile option.
My guess is: You have a loop in the code that throws off the vectorizer. The vectorizer will probably target the innermost loop, which is the one you wrote, and which contains a reduction operation. The vectorizer's comment sounds as if you gave a -ffast-math
flag (don't recall what this flag is called for OpenCL), it might be willing to reorder the operations.
Alternatively, you can unroll the loop completely, maybe also reducing its size. If the vectorizer doesn't see your loop, it might be more willing to vectorize across work items.
Thanks, that seems to be the issue here. Within some bounds, the compiler does vectorise the fully unrolled loop within the kernel. Performance is still poor, though. Also -cl-fast-relaxed-math
is similar to -ffast-math
and to some extent allows vectorisation without complete unrolling.
I had some other pattern in mind, which is applied by the Intel OpenCL SDK, where vectorisation happens across work-items within one group. Basically, the compiler generates a vector function from the OpenCL kernel (like what you get from #pragma omp declare simd
).
Isn't this mode of vectorisation supported by PoCL, too? If not it would be a great addition, since work-items have no implicit sequential order like the iterations of a for loop, which would require pessimistic assumptions to be made by the compiler.
After doing some more detailed reading, I think the parallel regions (between possible barriers) are very similar. So not sure if the llvm vectoriser message applies to the inner loop, or some outer loop across work-items introduced by PoCL. Although, if I annotate the inner loop with #pragma clang vectorize(disable)
or #pragma clang vectorize_width(8)
, nothing changes, so it's probably some outer work-item loop.
I did some more testing and also got messages like this. Seems tough to convince the auto-vectorizer to treat the work-items as more independent as for-loop-iterations.
remark: <unknown>:0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
....
remark: <unknown>:0:0: loop not vectorized: loop control flow is not understood by vectorizer
...
remark: <unknown>:0:0: the cost-model indicates that vectorization is not beneficial
Indeed, the parallel regions are pocl's way of producing implicit work-group level vectorization in a modular way. It creates parallel for loops out of the work-items and annotates them so they do not need loop dependency analysis in the loop vectorizer. Longer story can be read in the pocl paper that is listed in the publication page. Unfortunately, like I mentioned in the ML our group haven't had the time to pay enough attention to this in the past LLVM releases so there is some bit rot.
The particular issue you see is what I also noted recently. It seems some variable is used outside the parallel loop and it confuses the vectorizer. I think there was some effort in upstream LLVM to tackle this case so it might be improved when we have pocl working with LLVM 3.9. If not, then we need to track what is the value used outside the loop and try to eliminate that. It usually helps to find the parallel loop in parallel.bc and inspect it there.
Further, it currently always tries to outer loop vectorize when it thinks it's safe. This means it tries to push the WI loop inside any possible kernel sequential loops, assuming this usually leads to more freedom to later passes. This issue could be improved via adoption of the loop interchange pass that has now entered LLVM: it could more intelligently decide which loop should be the outer based on vectorization efficiency.
I notice the following:
- If the loop is completely unrolled, then the generated code will exceed the size of the L1 instruction cache, slowing things down considerably
- With
-ffast-math
or similar, the compiler will likely fold the loop into a single statementtmp_a += 10000 * tmp_b
- In this case, the code will perform 2 floating point operations while accessing 24 byte of memory, leading to an arithmetic intensity of 1/12 flop/byte. To run efficiently, most system require an intensity of at least 10 flop/byte. The code's performance will thus be limited by memory accesses, and vectorization will not lead to a significant (or even measurable) speed improvement.
@pjaaskel Thanks for the insights. What I, in the role of an application developer using OpenCL on a SIMD-machine, would like to have is plain and simple outer loop vectorisation across the work-items. The compiler unpredictably moving the vectorisation loop around messes up memory access patterns. I do a lot of code optimisation/modernisation, and even though compiler vectorisation has gotten better and better over the years, there is no automatic solution for optimising memory layouts for contiguous SIMD load/store to avoid costly gather/scatter operations. So as a developer that's something you have to figure out for your specific problem meaning that you have to know exactly how the code will be vectorised, although the compiler does it for you. Having compilers with different or changing over time strategies makes it really annoying to write performance portable SIMD code.
Loop interchange is supposed to take care of this decision automatically based on the memory access patterns. What if vectorizing over the "outer loop" (work items) is not the best for the kernel at hand due to an kernel inner loop that is more vectorizable without gather/scatter? Then it's again something some other OpenCL application developer would get annoyed with.
Or further: Let's say the kernel has multiple parallel regions (parts isolated with barriers) with each having different memory access patterns, inner loops, and thus decisions whether to do outerloop or innerloop vectorization. How to do this without compiler automatism? Just trying to understand what is the general best way here.
You can disable the "outer loop vectorization by default" behavior by removing "implict-loop-barriers" from the pass list in pocl_llvm_api.cc to see if it makes any difference in the cases of your interst.
@eschnett Thanks, I wasn't aware of the second issue. With OpenMP SIMD directives, I noticed the Intel compiler exchanging the division variant of the loop with a precise reciprocal from AVX-512 outside the loop and a multiplication within. Makes a big difference compared with AVX2. The original goal of that micro-benchmark was to collect performance results for all arithmetic operations and math built-ins. It should in theory generate lots of operations working on the same registers over and over again to cancel out memory bandwidth. The question for me was how much faster AVX512 is in comparison with AVX2, as there a differences in frequency and available instructions, you currently only can test that on KNL of course. The Intel compilers seem to play along that idea better. ;-)
You can compare the PoCL result with what Intel OpenCL does, if you get their SDK:
/opt/intel/opencl-1.2-sdk-5.0.0.62/bin/ioc64 -input=kernel_auto.cl -asm=kernel_auto.asm -simd=avx2
In theory, an AVX-512 vectorized PoCL kernel should be around twice as fast an AVX2 vectorized Intel OpenCL kernel. Haven't managed to find a scenario for these simple benchmark kernels that yields that result for simple arithmetic. Maybe the barrier calls, mentioned by @pjaaskel on the Mailing List are one reason. At least for the manual vectorisation I currently have no other explanation.
#340 refers to this same issue, I believe.
I think the issue is that the outer loop vectorization fails (albeit injecting the parallel loop inside the inner loop properly) due to the mysterious value that is being used outside the loop LLVM 3.8's vectorizer chokes with. We are getting close to getting LLVM 3.9 ported so we shall see if this just disappears with it.
Perhaps a good basic heuristics for when to attempt outer loop vectorizing is to attempt it by default (like it now does but fails with LLVM 3.8 due to the variable use issue) with multi-WI work groups and disable when a single-WI WG when it's more likely the kernel itself is optimized for vectorization either using explicit data types or vectorizable inner loops.
I studied this case a bit, here's what I found:
- The beneficiality of applying the outer loop vectorization is dependent on how long the inner loop body is. Because the implicit WG vectorization adds new loads and stores due to the need to save some of the variables in the context, it might become more costly than vectorizing over the original loop per each work-item. In this case at least the tmp_a needs to be context saved for each WI. When it's beneficial or not is not trivial to say in general. Currently the outer loop vectorization is disabled when there are no barriers in the original kernel. It needs a better heuristics, based on the loop body length or similar.
- get_global_id() returns size_t. In case it’s assigned to an int, there will be additional masking operations inserted by LLVM to the IR, which make it again more costly to vectorize the inner loop. It might be usually worthwhile to promote the type of the variable the ID is stored in to size_t by the compiler. When I changed it to size_t manually in your kernel, a vectorization beneficiality cost analysis of the loop vectorizer started say it's beneficial for the main loop, producing prettily vectorized loop in the end.
vector.body: ; preds = %vector.body, %pregion_for_entry.entry.i
%index = phi i32 [ 0, %pregion_for_entry.entry.i ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x double> [ %17, %pregion_for_entry.entry.i ], [ %18, %vector.body ]
%vec.phi5 = phi <4 x double> [ zeroinitializer, %pregion_for_entry.entry.i ], [ %19, %vector.body ]
%vec.phi6 = phi <4 x double> [ zeroinitializer, %pregion_for_entry.entry.i ], [ %20, %vector.body ]
%vec.phi7 = phi <4 x double> [ zeroinitializer, %pregion_for_entry.entry.i ], [ %21, %vector.body ]
%18 = fadd fast <4 x double> %vec.phi, %broadcast.splat12
%19 = fadd fast <4 x double> %vec.phi5, %broadcast.splat12
%20 = fadd fast <4 x double> %vec.phi6, %broadcast.splat12
%21 = fadd fast <4 x double> %vec.phi7, %broadcast.splat12
%index.next = add i32 %index, 16
%22 = icmp eq i32 %index.next, 10000
br i1 %22, label %middle.block, label %vector.body, !llvm.loop !19
- When reordering/parallelizing float ops, you need to relax the float requirements with the OpenCL switches. I used: bin/poclcc -b '-cl-unsafe-math-optimizations -cl-fast-relaxed-math' simple.cl
I will commit some small updates soon, but I consider this not a regression as new features are needed. Our group is not using Intel CPUs at the moment for our experiments, so optimizing the autovec for them is not high priority. ARM NEON is actually a bit more so. This is to say, feel free to take a stab if you want. My offer of helping as much as I can is still on the table.
Oh. Why the double8 version doesn't autovectorize further is because LLVM's LV freaks out of the double8 that is being passed to the loop (via a PHI node). It has some scalarization code that could help here, but it currently doesn't apply for vectors. Thus this case is likely to improve when the LLVM LV is improved.
See d622499. That env should be used in this case for the time being.
Hi, I'm working with @noma on this issue. I tried to use the variable added in https://github.com/pocl/pocl/commit/d6224995936129a2eaef84f2401dd0ef38444500. Unfortunately, I can't see any difference when using POCL_FORCE_PARALLEL_OUTER_LOOP
or not. I don't see any runtime difference and the binary files generated by poclcc
are the same. I used the following script:
KERNEL_FILE=kernel_auto.cl
POCLCC=/home/new2f7/pocl/bin/poclcc
CACHE_FILE=/home/new2f7/.cache/pocl/kcache/ED/ANOOKMOJGMOCDOIFKIAAOFMJOAOJLOOLLEKHJ/benchmark_op/0-0-0/benchmark_op.so
export POCL_DEBUG=1
unset POCL_FORCE_PARALLEL_OUTER_LOOP
$POCLCC $KERNEL_FILE > default.log 2>&1
mv $KERNEL_FILE.pocl default.cl.pocl
mv $CACHE_FILE default.bin
objdump -d default.bin > default.objdump
export POCL_FORCE_PARALLEL_OUTER_LOOP=1
$POCLCC $KERNEL_FILE > outer.log 2>&1
mv $KERNEL_FILE.pocl outer.cl.pocl
mv $CACHE_FILE outer.bin
objdump -d outer.bin > outer.objdump
diff default.log outer.log
diff default.cl.pocl outer.cl.pocl
diff default.bin outer.bin
diff default.objdump outer.objdump
This script generates the following output:
new2f7@host:~/pocl_kernel_test> ./script.sh
1c1
< [2018-03-16 11:11:16.004905577] POCL: in fn pocl_init_devices at line 398:
---
> [2018-03-16 11:11:17.523188782] POCL: in fn pocl_init_devices at line 398:
3c3
< [2018-03-16 11:11:16.746156487] POCL: in fn compile_and_link_program at line 506:
---
> [2018-03-16 11:11:18.212818086] POCL: in fn compile_and_link_program at line 506:
5c5
< [2018-03-16 11:11:16.746918864] POCL: in fn compile_and_link_program at line 561:
---
> [2018-03-16 11:11:18.213594763] POCL: in fn compile_and_link_program at line 561:
7c7
< [2018-03-16 11:11:17.273258882] POCL: in fn llvm_codegen at line 180:
---
> [2018-03-16 11:11:18.555749062] POCL: in fn llvm_codegen at line 180:
9c9
< [2018-03-16 11:11:17.285204490] POCL: in fn llvm_codegen at line 189:
---
> [2018-03-16 11:11:18.559093280] POCL: in fn llvm_codegen at line 189:
11c11
< [2018-03-16 11:11:17.285465032] POCL: in fn pocl_run_command at line 1050:
---
> [2018-03-16 11:11:18.559602934] POCL: in fn pocl_run_command at line 1050:
13c13
< [2018-03-16 11:11:17.318534555] POCL: in fn pocl_check_dlhandle_cache at line 863:
---
> [2018-03-16 11:11:18.641303898] POCL: in fn pocl_check_dlhandle_cache at line 863:
15c15
< [2018-03-16 11:11:17.320121018] POCL: in fn pocl_binary_serialize at line 640:
---
> [2018-03-16 11:11:18.643171873] POCL: in fn pocl_binary_serialize at line 640:
17c17
< [2018-03-16 11:11:17.321226167] POCL: in fn serialize_kernel_cachedir at line 393:
---
> [2018-03-16 11:11:18.644257512] POCL: in fn serialize_kernel_cachedir at line 393:
2c2
< default.bin: file format elf64-x86-64
---
> outer.bin: file format elf64-x86-64
new2f7@host:~/pocl_kernel_test>
@new2f7 some time ago, we added the "standard" (-O3) optimization passes early into the kernel pases, and they are now run before implicit-loop-barriers pass, which seems to interact with POCL_FORCE_PARALLEL_OUTER_LOOP.
But the main problem with this bugreport is this loop:
for (int i = 0; i < 10000; ++i)
tmp_a = tmp_b + tmp_a;
LLVM may unroll this, but it will not autovectorize this, unless you build with unsafe/fast math options (as @pjaaskel pointed out earlier). The reason is that floating-point addition is not associative, therefore a vectorized loop could lead to different results. So LLVM refusing to vectorize that loop is in fact the correct behaviour; if you try with -cl-unsafe-math-optimizations -cl-fast-relaxed-math
it should get vectorized.
@franz thanks for pointing out the OpenCL compiler options.
The actual goal here is to achieve outer-"loop" vectorization across the work-items, i.e. the kernel being compiled into a sth. like a SIMD function. For this vectorisation strategy, the non-associativity shouldn't matter.
I've also looked in to this and decided to write down my findings before I forget them. So the problem at the moment why this doesn't get vectorized is because LLVM loop vectorizer only support inner loop vectorization per function. Even all outer work-item loops are marked parallel by pocl using LLVM metadata. If inner loop vectorization fails then no other loops are tried in that function.
I this case inner loop vectorization has floating point reduction, it refuses to vectorize because of the floating-point reordering. Like already mentioned before and can be seen by this emitted remark remark: <unknown>:0:0: loop not vectorized: cannot prove it is safe to reorder floating-point operations
when using pocl's POCL_VECTORIZER_REMARKS
option.
More into the details. LLVM loop vectorizer is implemented as function pass so it's executed once per function. This pass will always select the inner most loop to process when vectorization plan (VPlan) is not enabled and loop is not explicitly marked to be vectorized with pragma. Innermost loop is selected recursively using function collectSupportedLoops. Each selected loop is processed with processLoop function. This function exit when it calls Requirements.doesNotMeet function and call returns true. Above remark is also emitted during this function call too. processLoop function returns false and that loop processing is done. Since the innermost loop was the only loop selected, vectorization pass is done for this function.
I tried to enable vectorization plan using hidden option --enable-vplan-native-path
for opt tool. Hidden options can be seen with opt --help-hidden
. But so far it didn't change the result of the vectorization. pocl can be enabled to use this option by modifying file pocl_llvm_utils.cc similarly how other options are added.
So it seems LLVM alone cannot vectorize this particular case easily at the moment. VPlan based vectorization is still in quite early stage in LLVM so it doesn't seem to help here.
...and the plan was to adapt the loop interchange pass to understand the parloop MD to perhaps get the outer loops switched to inner loops (selectively, when it's beneficial), as well as to utilize the parloop MD to communicate it's safe to reorder FP ops across WI loop's iterations.
Documenting again. I actually found out that this kernel is possible to vectorize with LLVM VPlan (Vectorization Plan) with specifically marking this loop with vectorization pragma, enabling hidden VPlan option (-enable-vplan-native-path
, see opt --help-hidden
for more info) and loop unrolling disabled. Loop unrolling disabling was important because currently VPlan is only able to vectorize simple loops with iteration count from 0 to n. I posted this as an to LLVM issue here.
VPlan in short is the future framework of LLVM for loop vectorizations and currently the existing inner loop vectorization is supposed to be refactored to use VPlan. The problem is that the progress is quite slow and VPlan is very unstable. It's missing legality and cost analysis. And currently to work is to make VPlan execution more stable itself. So until then other means for vectorization of kernels with loops has to be used like @pjaaskel mentioned above.
@Kazhuu Thanks, that's very interesting, which loop exactly did you annotate? The inner loop in the kernel or the implicit outer loop, i.e. the code somewhere in PoCL that processes the work-items?
I annotated the implicit work-item loop that pocl adds around the kernel body. Of course I tested this by hand without pocl by adding loop myself and annotate it with #pragma. You can find the corresponding C code here. Compile this with clang and disable unrolling with this command. You should get IR with vectorized enabled metadata.
Now running this IR through LLVM opt tool with this command:
opt function_vectorize.ll -o function_vectorize_vpath.ll -S --loop-vectorize --pass-remarks=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks-analysis=loop-vectorize --enable-vplan-native-path
you should get vectorized IR which is here.
Feel free to ask more if needed!
@Kazhuu Thank you for the detailed description. I had a look at pocl's custom LLVM pass that adds the implicit work-item loop. WorkitemLoops.cc adds one loop for each dimension (x,y,z).
In the past, I had a detailed look at the performance of the Intel OpenCL implementation. We observed that Intel's performance with automatic vectorisation is near manual vectorisation if a two-dimensional NDRange is used which is shaped in such a way, that one dimension equals the desired SIMD width that can be easily vectorised by the compiler:
// 1d-NDRange
{problem_size, 1, 1}
// 2d-NDRange, as with Intel
{simd_width, problem_size / simd_width, 1}
I would like to try the same with pocl and tell the compiler to vectorise one of the loops added in WorkitemLoops.cc:
// work-item loop-nest over 3 dims (x, y, z)
for(x; x < simd_width; ...) // force this loop to be vectorised, i.e. outer loop vectorisation
for(y...)
for(z...)
kernel_function()
Since the loops are added while the kernel is in IR, I cannot use #pragma clang loop vectorize(enable)
. Do you know how to force the compiler to vectorise one of these loops?
Thanks for your insight about the Interl OpenCL @new2f7. At the moment I'm currently myself working on improving PoCL's outer loop vectorization like you mentioned above. Try to force vectorization one of the generated work-item loops.
You can force outer loop vectorization with LLVM providing the loop with metadata. This is equivalend of having #pragma clang loop vectorize(enable)
line in the source code. With meta data in IR in place. then use -enable-vplan-native-path
flag when compiling the code. This causes LLVM to use VPlan vectorization for outer loop.
However the problem is that VPlan can be quite unstable and can only vectorize very specific or simple loop nests. For instance I opened this LLVM issue regarding VPlan vectorization crashing. So LLVM's outer loop vectorization is quite work in progress and these are the part of the issues I'm currently working on to improve PoCL outer loop vectorization in general.
In the past I've been looking to see how Intel's OpenCL CPU implementation handles the code generation but it seems the CPU OpenCL implementation is closed source and it's not possible to dump LLVM IR anymore like it used to be. I've seen someone mentioning it here on some PoCL issue.