android-riscv64 icon indicating copy to clipboard operation
android-riscv64 copied to clipboard

Investigate the current state of Auto-vectorization for RISC-V targets

Open appujee opened this issue 2 years ago • 18 comments

  • [x] Compiling TSVC benchmark would be a good way to find out if commonly found loop structures are getting vectorized.
  • [ ] Instruction scheduling of vectorized loops. If there is no instruction scheduling for RISC-V vectors, then we might have to create a separate task for this.
  • [x] Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

appujee avatar Feb 01 '23 17:02 appujee

There is no instruction scheduling for RISC-V vectors.

vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

topperc avatar Feb 14 '23 20:02 topperc

For the

Compiling TSVC benchmark would be a good way to find out if commonly found loop structures are getting vectorized.

What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?

nikolaypanchenko avatar Feb 14 '23 20:02 nikolaypanchenko

It will be great to get examples from Android where auto-vectorization for RISC-V Vectors (RVV) is not performing as expected, e.g. compared with X86 and AArch64 targets.

idbaev avatar Feb 14 '23 21:02 idbaev

What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?

It is usually a good exercise to compare number of vectorizable loops, it helps tune the vectorizer and find more opportunities for vectorization. We should do this comparative analysis at least w.r.t. AArch64.

Once this is done, we can do a comparative analysis on larger code base like Android.

appujee avatar Feb 14 '23 22:02 appujee

There is no instruction scheduling for RISC-V vectors.

vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

I was thinking if inlining may bring up redundant vsetvlis. In functions with multiple loops this can also happen.

appujee avatar Feb 14 '23 22:02 appujee

There is no instruction scheduling for RISC-V vectors. vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

I was thinking if inlining may bring up redundant vsetvlis.

vsetvli intrinsics are allowed to CSE as of last week.

In functions with multiple loops this can also happen.

This depends on what style of vector loop you're right. If you're using vsetvli inside the loop to avoid tail iterations then you'll need vsetvlis inside each loop so none are redundant. If you're using vsetvlmax and operating on whole registers in the loop then yes there could be a redundant one for each loop.

The current loop vectorizer operates on whole registers but doesn't use vsetvli intrinsics. The vsetvlis are all inserted by the insertion pass which runs just before machine IR leaves SSA form.

topperc avatar Feb 14 '23 22:02 topperc

vsetvli intrinsics are allowed to CSE as of last week.

ah ok. this should be sufficient. thanks for clarifying.

appujee avatar Feb 15 '23 01:02 appujee

@appujee could you please provide options to use for ARM to compiler TSVC benchmark ? Do you have specific options for RISC-V ?

nikolaypanchenko avatar Feb 16 '23 19:02 nikolaypanchenko

Try -mcpu=cortex-a55 for ARM

For RISC-V rv64gcv, please share if you have a cpu flag that gives better vectorization.

appujee avatar Feb 16 '23 19:02 appujee

The number of loops vectorized as-is using upstream LLVM 5c1b8de77d1c:

Arch Number of vectorized loops
-march= rv64gcv 1299
-mcpu=cortex-a55 904

Obviously, the performance of vectorized loops is a different aspect, but it won't be easy to answer for RISC-V in general

nikolaypanchenko avatar Feb 17 '23 00:02 nikolaypanchenko

nice! is it possible to know how many loops we start with in both the cases? Like do we have something like 'number of loops analyzed'. It could be that inlining etc. resulted in different number of loops to begin with.

appujee avatar Feb 17 '23 00:02 appujee

Updated: my original numbers didn't include loops from tsvc.c

Arch default fp-model (strict for Clang) fp-model=strict (same as default) fp-model=fast
#LoopsAnalyzed #LoopsVectorized #LoopsAnalyzed #LoopsVectorized #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 735 460 735 460 735 667
-mcpu=cortex-a55 736 176 736 176 735 635

Details:

  • with fp-model=strict common.c:
Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 555 373
-mcpu=cortex-a55 555 176

tsvc.c

Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 180 87
-mcpu=cortex-a55 181 0
  • with fp-model=fast common.c:
Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 555 553
-mcpu=cortex-a55 555 552

tsvc.c

Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 180 114
-mcpu=cortex-a55 180 83

nikolaypanchenko avatar Feb 17 '23 16:02 nikolaypanchenko

That's very promising as RISCV is ahead. I've marked the first item as done. Thanks for helping with this.

appujee avatar Feb 17 '23 20:02 appujee

There is no instruction scheduling for RISC-V vectors.

vsetvli intrinsics are allowed to CSE as of last week.

As per: https://github.com/llvm/llvm-project/issues/58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko

appujee avatar May 07 '23 18:05 appujee

There is no instruction scheduling for RISC-V vectors.

vsetvli intrinsics are allowed to CSE as of last week.

As per: llvm/llvm-project#58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko

The code in that ticket does not look like how the current upstream vectorizer or the proposed VP intrinsic vectorizer from our downstream generate code so I don't think it is directly relevant to autovectorization.

topperc avatar May 07 '23 19:05 topperc

  • Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

I believe @appujee refers to the 3d task:

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

which may or may not be treated as any vset*vli generated by codegen within vectorized loop. @topperc do you know if anyone started to look at that reported issue ?

nikolaypanchenko avatar May 08 '23 19:05 nikolaypanchenko

I believe @appujee refers to the 3d task:

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

which may or may not be treated as any vset*vli generated by codegen within vectorized loop.

What we currently have doesn't remove any vsetvlis generated by explicit vsetvli intrinsics. We do a reaching def like analysis to insert additional vsetvlis whereever we think they are needed to satisfy SEW, LMUL, tail policy, mask policy needed for the vector load/store/arithmetic instructions.

@topperc do you know if anyone started to look at that reported issue ?

I don't think anyone has looked at the issue. We have a reaching definition analysis. We detect a mismatch because the preheader edge sees the vsetvli in the preheader and the backedge sees the vsetvli from the previous iteration.

topperc avatar May 08 '23 19:05 topperc

Pipeline cost model for vector instructions: D149495 posted by michaelmaitland. We can use -mcpu=sifive-x280 to try it out.

appujee avatar May 10 '23 16:05 appujee