llvm-project
llvm-project copied to clipboard
Pointless loop unroll / vectorization
Bugzilla Link | 38280 |
Version | 6.0 |
OS | Windows NT |
Depends On | llvm/llvm-project#40224 |
Reporter | LLVM Bugzilla Contributor |
CC | @davidbolvansky,@DMG862,@fhahn,@hfinkel,@LebedevRI,@RKSimon,@rotateright |
Extended Description
Example C++ code for x86, simplified from a more complex use case:
// ---- begin
#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>
// neg_offs <= -8 required
void apply_delta(uint8_t *dst, const uint8_t *src, ptrdiff_t neg_offs, size_t count)
{
// Just provided for context
while (count >= 8)
{
__m128i src_bytes = _mm_loadl_epi64((const __m128i *) src);
__m128i pred_bytes = _mm_loadl_epi64((const __m128i *) (dst + neg_offs));
__m128i sum = _mm_add_epi8(src_bytes, pred_bytes);
_mm_storel_epi64((__m128i *) dst, sum);
dst += 8;
src += 8;
count -= 8;
}
// This is the loop in question
while (count--)
{
*dst = *src + dst[neg_offs];
dst++;
src++;
}
}
// ---- end
The bottom (tail) loop gets expanded into a giant monstrosity that attempts to process 64 bytes at once, with various special-case paths for tail processing, to handle cases where neg_offs > -64 (which means the obvious 64-elements-at-a-time loop would not work), etc.
The full code can be viewed at https://godbolt.org/g/yRThcs, I won't post it here. :)
All of which is completely pointless because the tail loop will (as is easy to see) only ever see count <= 7.
This is an extreme example, but I'm seeing this general pattern (a scalar tail loop for a manually vectorized loop getting pointlessly auto-vectorized) a lot.
mentioned in issue llvm/llvm-project#40224
Maybe instcombine could fold it to just AND?
On a perhaps related note, the very simple example (https://godbolt.org/g/MToVLm):
unsigned func(unsigned count)
{
while (count >= 32)
count -= 32;
return count;
}
Currently produces something like (thanks to indvarsimplify):
define i32 @test(i32 %size) {
entry:
%0 = xor i32 %size, -1
%1 = icmp ugt i32 %0, -32
%umax = select i1 %1, i32 %0, i32 -32
%2 = add i32 %umax, %size
%3 = add i32 %2, 32
%4 = and i32 %3, -32
%5 = sub i32 %size, %4
ret i32 %5
}
Which is really just: https://rise4fun.com/Alive/lKL
define i32 @test(i32 %size) {
%0 = and i32 %size, 31
ret i32 %0
}
Current Codegen: https://godbolt.org/z/qsMqxM9rn
define i32 @func(i32 %count) {
entry:
%0 = add i32 %count, 31
%umin = call i32 @llvm.umin.i32(i32 %count, i32 31)
%1 = sub i32 %0, %umin
%2 = and i32 %1, -32
%3 = sub i32 %count, %2
ret i32 %3
}
declare i32 @llvm.umin.i32(i32, i32)
https://alive2.llvm.org/ce/z/L4ni72
----------------------------------------
define i32 @src(i32 %count) {
%entry:
%0 = add i32 %count, 31
%umin = umin i32 %count, 31
%1 = sub i32 %0, %umin
%2 = and i32 %1, 4294967264
%3 = sub i32 %count, %2
ret i32 %3
}
=>
define i32 @tgt(i32 %size) {
%0:
%m = and i32 %size, 31
ret i32 %m
}
Transformation seems to be correct!
I think we're just missing the folds:
----------------------------------------
define i4 @src(i4 %x, i4 %y) {
%0:
%a = add i4 %x, %y
%m = umin i4 %x, %y
%s = sub i4 %a, %m
ret i4 %s
}
=>
define i4 @tgt(i4 %x, i4 %y) {
%0:
%m = umax i4 %x, %y
ret i4 %m
}
Transformation seems to be correct!
and
----------------------------------------
define i4 @src(i4 %x, i4 %y) {
%0:
%a = add i4 %x, %y
%m = umax i4 %x, %y
%s = sub i4 %a, %m
ret i4 %s
}
=>
define i4 @tgt(i4 %x, i4 %y) {
%0:
%m = umin i4 %x, %y
ret i4 %m
}
Transformation seems to be correct!
@nikic Any thoughts?
@RKSimon Sounds reasonable. I guess a generalization here could be that (z + y) - umin(x, y)
is z + usub.sat(y, x)
, which happens to be umax(x, y)
for the case where z == y
. Of course, that transform chain could introduce more one-use restrictions.
cheers, I'll create a patch
cheers, I'll create a patch
https://reviews.llvm.org/D123399
It looks like ValueTracking is missing the ability to recognise when phi sources branched due to the source having a particular icmp value range:
entry:
%cmp23 = icmp ugt i64 %count, 7
br i1 %cmp23, label %while.body, label %while.cond5.preheader
while.body:
....
%cmp = icmp ugt i64 %sub, 7
br i1 %cmp, label %while.body, label %while.cond5.preheader
while.cond5.preheader:
%count.addr.0.lcssa = phi i64 [ %count, %entry ], [ %sub, %while.body ] ; both %count and %sub should be !(ugt 7)
....
%min.iters.check = icmp ult i64 %count.addr.0.lcssa, 16 ; should be always true
@nikic @rotateright Are you familiar with phi handling in ValueTracking KnownBits at all?
computeKnownBits takes the common known bits from incoming values, but it doesn't use dominating conditions AFAICT. This might be a job for CVP?
Candidate Patch: https://reviews.llvm.org/D131838
Candidate Patch: https://reviews.llvm.org/D131838
This fixed the vectorization issue - but not the unrolling.
I should add that the ValueTracking phi handling is post-fixing the problem, not preventing it - LoopVectorize still runs and adds all the vector code, but later passes then realise its all dead code and remove it again....
The last remaining issue is the scalar loop unrolling: https://godbolt.org/z/f373Tc7oM
ValueTracking has failed to realise that the '%count.addr.0.lcssa value is known to be 0 < v < 8:
while.body6.preheader: ; preds = %while.cond5.preheader
%7 = add i64 %count.addr.0.lcssa, -1
.....
while.body6.prol.loopexit: ; preds = %while.body6.prol, %while.body6.preheader
%dst.addr.132.unr = phi ptr [ %dst.addr.0.lcssa, %while.body6.preheader ], [ %incdec.ptr.prol, %while.body6.prol ]
%src.addr.131.unr = phi ptr [ %src.addr.0.lcssa, %while.body6.preheader ], [ %incdec.ptr9.prol, %while.body6.prol ]
%count.addr.130.unr = phi i64 [ %count.addr.0.lcssa, %while.body6.preheader ], [ %dec.prol, %while.body6.prol ]
%10 = icmp ult i64 %7, 7
br i1 %10, label %while.end10, label %while.body6
#57635 discusses folding %10
patterns to:
%10 = icmp ule i64 %count.addr.0.lcssa, 7