llvm-project Pointless loop unroll / vectorization

Bugzilla Link 38280

Version 6.0

OS Windows NT

Depends On llvm/llvm-project#40224

Reporter LLVM Bugzilla Contributor

CC @davidbolvansky,@DMG862,@fhahn,@hfinkel,@LebedevRI,@RKSimon,@rotateright


Bugzilla Link	38280
Version	6.0
OS	Windows NT
Depends On	llvm/llvm-project#40224
Reporter	LLVM Bugzilla Contributor
CC	@davidbolvansky,@DMG862,@fhahn,@hfinkel,@LebedevRI,@RKSimon,@rotateright

Extended Description

Example C++ code for x86, simplified from a more complex use case:

// ---- begin

#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>

// neg_offs <= -8 required
void apply_delta(uint8_t *dst, const uint8_t *src, ptrdiff_t neg_offs, size_t count)
{
    // Just provided for context
    while (count >= 8)
    {
        __m128i src_bytes = _mm_loadl_epi64((const __m128i *) src);
        __m128i pred_bytes = _mm_loadl_epi64((const __m128i *) (dst + neg_offs));
        __m128i sum = _mm_add_epi8(src_bytes, pred_bytes);
        _mm_storel_epi64((__m128i *) dst, sum);

        dst += 8;
        src += 8;
        count -= 8;
    }

    // This is the loop in question
    while (count--)
    {
        *dst = *src + dst[neg_offs];
        dst++;
        src++;
    }
}

// ---- end

The bottom (tail) loop gets expanded into a giant monstrosity that attempts to process 64 bytes at once, with various special-case paths for tail processing, to handle cases where neg_offs > -64 (which means the obvious 64-elements-at-a-time loop would not work), etc.

The full code can be viewed at https://godbolt.org/g/yRThcs, I won't post it here. :)

All of which is completely pointless because the tail loop will (as is easy to see) only ever see count <= 7.

This is an extreme example, but I'm seeing this general pattern (a scalar tail loop for a manually vectorized loop getting pointlessly auto-vectorized) a lot.

Jul 23 '18 21:07 llvmbot

mentioned in issue llvm/llvm-project#40224

Nov 27 '21 01:11 RKSimon

Maybe instcombine could fold it to just AND?

Oct 05 '19 08:10 davidbolvansky

On a perhaps related note, the very simple example (https://godbolt.org/g/MToVLm):

unsigned func(unsigned count)
{
    while (count >= 32)
        count -= 32;
    return count;
}

Currently produces something like (thanks to indvarsimplify):

define i32 @test(i32 %size) {
entry:
  %0 = xor i32 %size, -1
  %1 = icmp ugt i32 %0, -32
  %umax = select i1 %1, i32 %0, i32 -32
  %2 = add i32 %umax, %size
  %3 = add i32 %2, 32
  %4 = and i32 %3, -32
  %5 = sub i32 %size, %4
  ret i32 %5
}

Which is really just: https://rise4fun.com/Alive/lKL

define i32 @test(i32 %size) {
  %0 = and i32 %size, 31
  ret i32 %0
}

Jul 23 '18 22:07 4c958a03-4ad7-407c-a85b-320d9a7221b0

Current Codegen: https://godbolt.org/z/qsMqxM9rn

define i32 @func(i32 %count) {
entry:
  %0 = add i32 %count, 31
  %umin = call i32 @llvm.umin.i32(i32 %count, i32 31)
  %1 = sub i32 %0, %umin
  %2 = and i32 %1, -32
  %3 = sub i32 %count, %2
  ret i32 %3
}
declare i32 @llvm.umin.i32(i32, i32)

Apr 08 '22 14:04 RKSimon

https://alive2.llvm.org/ce/z/L4ni72

----------------------------------------
define i32 @src(i32 %count) {
%entry:
  %0 = add i32 %count, 31
  %umin = umin i32 %count, 31
  %1 = sub i32 %0, %umin
  %2 = and i32 %1, 4294967264
  %3 = sub i32 %count, %2
  ret i32 %3
}
=>
define i32 @tgt(i32 %size) {
%0:
  %m = and i32 %size, 31
  ret i32 %m
}
Transformation seems to be correct!

Apr 08 '22 14:04 RKSimon

I think we're just missing the folds:

----------------------------------------
define i4 @src(i4 %x, i4 %y) {
%0:
  %a = add i4 %x, %y
  %m = umin i4 %x, %y
  %s = sub i4 %a, %m
  ret i4 %s
}
=>
define i4 @tgt(i4 %x, i4 %y) {
%0:
  %m = umax i4 %x, %y
  ret i4 %m
}
Transformation seems to be correct!

and


----------------------------------------
define i4 @src(i4 %x, i4 %y) {
%0:
  %a = add i4 %x, %y
  %m = umax i4 %x, %y
  %s = sub i4 %a, %m
  ret i4 %s
}
=>
define i4 @tgt(i4 %x, i4 %y) {
%0:
  %m = umin i4 %x, %y
  ret i4 %m
}
Transformation seems to be correct!

@nikic Any thoughts?

Apr 08 '22 15:04 RKSimon

@RKSimon Sounds reasonable. I guess a generalization here could be that (z + y) - umin(x, y) is z + usub.sat(y, x), which happens to be umax(x, y) for the case where z == y. Of course, that transform chain could introduce more one-use restrictions.

Apr 08 '22 15:04 nikic

cheers, I'll create a patch

Apr 08 '22 15:04 RKSimon

cheers, I'll create a patch

https://reviews.llvm.org/D123399

Apr 08 '22 16:04 RKSimon

It looks like ValueTracking is missing the ability to recognise when phi sources branched due to the source having a particular icmp value range:

entry:
  %cmp23 = icmp ugt i64 %count, 7
  br i1 %cmp23, label %while.body, label %while.cond5.preheader

while.body:     
  ....
  %cmp = icmp ugt i64 %sub, 7
  br i1 %cmp, label %while.body, label %while.cond5.preheader

while.cond5.preheader:
  %count.addr.0.lcssa = phi i64 [ %count, %entry ], [ %sub, %while.body ] ; both %count and %sub should be !(ugt 7)
  ....
  %min.iters.check = icmp ult i64 %count.addr.0.lcssa, 16 ; should be always true

@nikic @rotateright Are you familiar with phi handling in ValueTracking KnownBits at all?

Jun 29 '22 14:06 RKSimon

computeKnownBits takes the common known bits from incoming values, but it doesn't use dominating conditions AFAICT. This might be a job for CVP?

Jul 05 '22 19:07 rotateright

Candidate Patch: https://reviews.llvm.org/D131838

Aug 13 '22 13:08 RKSimon

Candidate Patch: https://reviews.llvm.org/D131838

This fixed the vectorization issue - but not the unrolling.

I should add that the ValueTracking phi handling is post-fixing the problem, not preventing it - LoopVectorize still runs and adds all the vector code, but later passes then realise its all dead code and remove it again....

Aug 17 '22 09:08 RKSimon

The last remaining issue is the scalar loop unrolling: https://godbolt.org/z/f373Tc7oM

ValueTracking has failed to realise that the '%count.addr.0.lcssa value is known to be 0 < v < 8:

while.body6.preheader: ; preds = %while.cond5.preheader
  %7 = add i64 %count.addr.0.lcssa, -1
  .....
  
while.body6.prol.loopexit: ; preds = %while.body6.prol, %while.body6.preheader
  %dst.addr.132.unr = phi ptr [ %dst.addr.0.lcssa, %while.body6.preheader ], [ %incdec.ptr.prol, %while.body6.prol ]
  %src.addr.131.unr = phi ptr [ %src.addr.0.lcssa, %while.body6.preheader ], [ %incdec.ptr9.prol, %while.body6.prol ]
  %count.addr.130.unr = phi i64 [ %count.addr.0.lcssa, %while.body6.preheader ], [ %dec.prol, %while.body6.prol ]
  %10 = icmp ult i64 %7, 7
  br i1 %10, label %while.end10, label %while.body6

#57635 discusses folding %10 patterns to:

%10 = icmp ule i64 %count.addr.0.lcssa, 7

Sep 10 '22 10:09 RKSimon

llvm-project llvm-project copied to clipboard

Pointless loop unroll / vectorization

Extended Description

llvm-project
llvm-project copied to clipboard