llvm-project [LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL.

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions. These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV (just adds a new tail-folding mode using EVL), but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. The patch adds a new VPlanTransforms to replace the wide header predicate compare with EVL and updates codegen for load/stores to use VP store/load with EVL.

Other important part of this approach is how the Explicit Vector Length is computed. (VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We use an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

Differential Revision: https://reviews.llvm.org/D99750

Dec 21 '23 18:12 alexey-bataev

@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Alexey Bataev (alexey-bataev) Co-Authored-By: Vineet Kumar (vntkmr) Co-Authored-By: Roger Ferrer Ibáñez (rofirrim) Co-Authored-By: Simon Moll (simoll)

Changes

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

===Tentative Development Roadmap===

Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Differential Revision: https://reviews.llvm.org/D99750

Patch is 101.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76172.diff

24 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4-1)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+16)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+151-8)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+43)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+8-8)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+98-13)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+51)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+65-1)
(added) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll (+142)
(added) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll (+125)
(added) llvm/test/Transforms/LoopVectorize/X86/vectorize-vp-intrinsics.ll (+127)
(added) llvm/test/Transforms/LoopVectorize/X86/vplan-vp-intrinsics.ll (+83)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-gather-scatter.ll (+64)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-interleave.ll (+169)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-iv32.ll (+84)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-masked-loadstore.ll (+81)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-no-masking.ll (+46)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-reverse-load-store.ll (+64)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics.ll (+97)
(added) llvm/test/Transforms/LoopVectorize/vplan-vp-intrinsics.ll (+36)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 735be3680aea0d..e2a127ff35be26 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
   /// Use predicate to control both data and control flow, but modify
   /// the trip count so that a runtime overflow check can be avoided
   /// and such that the scalar epilogue loop can always be removed.
-  DataAndControlFlowWithoutRuntimeCheck
+  DataAndControlFlowWithoutRuntimeCheck,
+  /// Use predicated EVL instructions for tail-folding.
+  /// Indicates that VP intrinsics should be used if tail-folding is enabled.
+  DataWithEVL,
 };
 
 struct TailFoldingInfo {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4614446b2150b7..1a9abaea811159 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
   return TTI::TCC_Free;
 }
 
+bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
+  return ST->hasVInstructions();
+}
+
 TargetTransformInfo::PopcntSupportKind
 RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
   assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 96ecc771863e56..d2592be75000de 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                       const APInt &Imm, Type *Ty,
                                       TTI::TargetCostKind CostKind);
 
+  /// \name Vector Predication Information
+  /// Whether the target supports the %evl parameter of VP intrinsic efficiently
+  /// in hardware, for the given opcode and type/alignment. (see LLVM Language
+  /// Reference - "Vector Predication Intrinsics",
+  /// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
+  /// "IR-level VP intrinsics",
+  /// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
+  /// \param Opcode the opcode of the instruction checked for predicated version
+  /// support.
+  /// \param DataType the type of the instruction with the \p Opcode checked for
+  /// prediction support.
+  /// \param Alignment the alignment for memory access operation checked for
+  /// predicated version support.
+  bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+                             Align Alignment) const;
+
   TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
 
   bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..7b0e268877ded3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -123,6 +123,7 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
+#include "llvm/IR/VectorBuilder.h"
 #include "llvm/IR/Verifier.h"
 #include "llvm/Support/Casting.h"
 #include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
         clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
                    "Create lane mask using active.lane.mask intrinsic, and use "
                    "it for both data and control flow"),
-        clEnumValN(
-            TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
-            "data-and-control-without-rt-check",
-            "Similar to data-and-control, but remove the runtime check")));
+        clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
+                   "data-and-control-without-rt-check",
+                   "Similar to data-and-control, but remove the runtime check"),
+        clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
+                   "Use predicated EVL instructions for tail folding if the "
+                   "target supports vector length predication")));
 
 static cl::opt<bool> MaximizeBandwidth(
     "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1106,8 +1109,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
       if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
           isa<VPInterleaveRecipe>(CurRec) ||
           isa<VPScalarIVStepsRecipe>(CurRec) ||
-          isa<VPCanonicalIVPHIRecipe>(CurRec) ||
-          isa<VPActiveLaneMaskPHIRecipe>(CurRec))
+          isa<VPHeaderPHIRecipe>(CurRec))
         continue;
 
       // This recipe contributes to the address computation of a widen
@@ -1655,6 +1657,23 @@ class LoopVectorizationCostModel {
     return foldTailByMasking() || Legal->blockNeedsPredication(BB);
   }
 
+  /// Returns true if VP intrinsics with explicit vector length support should
+  /// be generated in the tail folded loop.
+  bool useVPIWithVPEVLVectorization() const {
+    return PreferEVL && !EnableVPlanNativePath &&
+           getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+           // FIXME: implement support for max safe dependency distance.
+           Legal->isSafeForAnyVectorWidth() &&
+           // FIXME: remove this once reductions are supported.
+           Legal->getReductionVars().empty() &&
+           // FIXME: remove this once vp_reverse is supported.
+           none_of(
+               WideningDecisions,
+               [](const std::pair<std::pair<Instruction *, ElementCount>,
+                                  std::pair<InstWidening, InstructionCost>>
+                      &Data) { return Data.second.first == CM_Widen_Reverse; });
+  }
+
   /// Returns true if the Phi is part of an inloop reduction.
   bool isInLoopReduction(PHINode *Phi) const {
     return InLoopReductions.contains(Phi);
@@ -1800,6 +1819,10 @@ class LoopVectorizationCostModel {
   /// All blocks of loop are to be masked to fold tail of scalar iterations.
   bool CanFoldTailByMasking = false;
 
+  /// Control whether to generate VP intrinsics with explicit-vector-length
+  /// support in vectorized code.
+  bool PreferEVL = false;
+
   /// A map holding scalar costs for different vectorization factors. The
   /// presence of a cost for an instruction in the mapping indicates that the
   /// instruction will be scalarized when vectorizing with the associated
@@ -4883,6 +4906,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
   if (Legal->prepareToFoldTailByMasking()) {
     CanFoldTailByMasking = true;
+    if (getTailFoldingStyle() == TailFoldingStyle::None)
+      return MaxFactors;
+
+    if (UserIC > 1) {
+      LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                           "not generate VP intrinsics since interleave count "
+                           "specified is greater than 1.\n");
+      return MaxFactors;
+    }
+
+    if (MaxFactors.ScalableVF.isVector()) {
+      assert(MaxFactors.ScalableVF.isScalable() &&
+             "Expected scalable vector factor.");
+      // FIXME: use actual opcode/data type for analysis here.
+      PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+                  TTI.hasActiveVectorLength(0, nullptr, Align());
+#if !NDEBUG
+      if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
+        if (PreferEVL)
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "try to generate VP Intrinsics.\n";
+        else
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "not try to generate VP Intrinsics since the target "
+                    "does not support vector length predication.\n";
+      }
+#endif // !NDEBUG
+
+      // Tail folded loop using VP intrinsics restricts the VF to be scalable.
+      if (PreferEVL)
+        MaxFactors.FixedVF = ElementCount::getFixed(1);
+    }
+
     return MaxFactors;
   }
 
@@ -5493,6 +5549,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   if (!isScalarEpilogueAllowed())
     return 1;
 
+  // Do not interleave if EVL is preferred and no User IC is specified.
+  if (useVPIWithVPEVLVectorization())
+    return 1;
+
   // We used the distance for the interleave count.
   if (!Legal->isSafeForAnyVectorWidth())
     return 1;
@@ -8622,6 +8682,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
         VPlanTransforms::truncateToMinimalBitwidths(
             *Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
       VPlanTransforms::optimize(*Plan, *PSE.getSE());
+      if (CM.useVPIWithVPEVLVectorization())
+        VPlanTransforms::addExplicitVectorLength(*Plan);
       assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
       VPlans.push_back(std::move(Plan));
     }
@@ -9454,6 +9516,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
       State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
 }
 
+/// Creates either vp_store or vp_scatter intrinsics calls to represent
+/// predicated store/scatter.
+static Instruction *
+lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
+                                Value *StoredVal, bool IsScatter, Value *Mask,
+                                Value *EVLPart, const Align &Alignment) {
+  CallInst *Call;
+  if (IsScatter) {
+    Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
+                                   Intrinsic::vp_scatter,
+                                   {StoredVal, Addr, Mask, EVLPart});
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
+        {StoredVal, Addr}));
+  }
+  Call->addParamAttr(
+      1, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
+/// Creates either vp_load or vp_gather intrinsics calls to represent
+/// predicated load/gather.
+static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
+                                                   VectorType *DataTy,
+                                                   Value *Addr, bool IsGather,
+                                                   Value *Mask, Value *EVLPart,
+                                                   const Align &Alignment) {
+  CallInst *Call;
+  if (IsGather) {
+    Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
+                                   {Addr, Mask, EVLPart}, nullptr,
+                                   "wide.masked.gather");
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Load, DataTy, Addr, "vp.op.load"));
+  }
+  Call->addParamAttr(
+      0, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
 void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
 
@@ -9523,6 +9631,12 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     return PartPtr;
   };
 
+  auto MaskValue = [&](unsigned Part) -> Value * {
+    if (isMaskRequired)
+      return BlockInMaskParts[Part];
+    return nullptr;
+  };
+
   // Handle Stores:
   if (SI) {
     State.setDebugLocFrom(SI->getDebugLoc());
@@ -9530,7 +9644,22 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     for (unsigned Part = 0; Part < State.UF; ++Part) {
       Instruction *NewSI = nullptr;
       Value *StoredVal = State.get(StoredValue, Part);
-      if (CreateGatherScatter) {
+      if (State.EVL) {
+        Value *EVLPart = State.get(State.EVL, Part);
+        // If EVL is not nullptr, then EVL must be a valid value set during plan
+        // creation, possibly default value = whole vector register length. EVL
+        // is created only if TTI prefers predicated vectorization, thus if EVL
+        // is not nullptr it also implies preference for predicated
+        // vectorization.
+        // FIXME: Support reverse store after vp_reverse is added.
+        NewSI = lowerStoreUsingVectorIntrinsics(
+            Builder,
+            CreateGatherScatter
+                ? State.get(getAddr(), Part)
+                : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+            StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
+            Alignment);
+      } else if (CreateGatherScatter) {
         Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
         Value *VectorGep = State.get(getAddr(), Part);
         NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9561,7 +9690,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   State.setDebugLocFrom(LI->getDebugLoc());
   for (unsigned Part = 0; Part < State.UF; ++Part) {
     Value *NewLI;
-    if (CreateGatherScatter) {
+    if (State.EVL) {
+      Value *EVLPart = State.get(State.EVL, Part);
+      // If EVL is not nullptr, then EVL must be a valid value set during plan
+      // creation, possibly default value = whole vector register length. EVL
+      // is created only if TTI prefers predicated vectorization, thus if EVL
+      // is not nullptr it also implies preference for predicated
+      // vectorization.
+      // FIXME: Support reverse loading after vp_reverse is added.
+      NewLI = lowerLoadUsingVectorIntrinsics(
+          Builder, DataTy,
+          CreateGatherScatter
+              ? State.get(getAddr(), Part)
+              : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+          CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
+    } else if (CreateGatherScatter) {
       Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
       Value *VectorGep = State.get(getAddr(), Part);
       NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 94cb7688981361..0ca668abbe60c7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -242,6 +242,12 @@ struct VPTransformState {
   ElementCount VF;
   unsigned UF;
 
+  /// If EVL is not nullptr, then EVL must be a valid value set during plan
+  /// creation, possibly a default value = whole vector register length. EVL is
+  /// created only if TTI prefers predicated vectorization, thus if EVL is
+  /// not nullptr it also implies preference for predicated vectorization.
+  VPValue *EVL = nullptr;
+
   /// Hold the indices to generate specific scalar instructions. Null indicates
   /// that all instances are to be generated, using either scalar or vector
   /// instructions.
@@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     SLPLoad,
     SLPStore,
     ActiveLaneMask,
+    ExplicitVectorLength,
+    ExplicitVectorLengthIVIncrement,
     CalculateTripCountMinusVF,
     // Increment the canonical IV separately for each unrolled part.
     CanonicalIVIncrementForPart,
@@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     default:
       return false;
     case VPInstruction::ActiveLaneMask:
+    case VPInstruction::ExplicitVectorLength:
+    case VPInstruction::ExplicitVectorLengthIVIncrement:
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
     case VPInstruction::BranchOnCount:
@@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
 #endif
 };
 
+/// A recipe for generating the phi node for the current index of elements,
+/// adjusted in accordance with EVL value. It starts at StartIV value and gets
+/// incremented by EVL in each iteration of the vector loop.
+class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+public:
+  VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
+      : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
+
+  ~VPEVLBasedIVPHIRecipe() override = default;
+
+  VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+
+  static inline bool classof(const VPHeaderPHIRecipe *D) {
+    return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
+  }
+
+  /// Generate phi for handling IV based on EVL over iterations correctly.
+  void execute(VPTransformState &State) override;
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return true;
+  }
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+};
+
 /// A Recipe for widening the canonical induction variable of the vector loop.
 class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
 public:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 97a8a1803bbf5a..b8ed256d236a4b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
   Type *ResultTy =
       TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
           .Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
-                VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
-              [this](const auto *R) {
-                // Handle header phi recipes, except VPWienIntOrFpInduction
-                // which needs special handling due it being possibly truncated.
-                // TODO: consider inferring/caching type of siblings, e.g.,
-                // backedge value, here and in cases below.
-                return inferScalarType(R->getStartValue());
-              })
+                VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
+                VPEVLBasedIVPHIRecipe>([this](const auto *R) {
+            // Handle header phi recipes, except VPWienIntOrFpInduction
+            // which needs special handling due it being possibly truncated.
+            // TODO: consider inferring/caching type of siblings, e.g.,
+            // backedge value, here and in cases below.
+            return inferScalarType(R->getStartValue());
+          })
           .Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
               [](const auto *R) { return R->getScalarType(); })
           .Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 02e400d590bed4..5e0344a14df5da 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -345,6 +345,44 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
     Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
     return Builder.CreateSelect(Cmp, Sub, Zero);
   }
+  case VPInstruction::ExplicitVectorLength: {
+    // Compute EVL
+    auto GetSetVL = [=](VPTransformState &State, Value *EVL) {
+      assert(EVL->getType()->isIntegerTy() &&
+             "Requested vector length should be an integer.");
+
+      // TODO: Add support for MaxSafeDist for correct loop emission.
+      Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+
+      Value *GVL = State.Builder.CreateIntrinsic(
+          State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+          {EVL, VFArg, State.Builder.getTrue()});
+      return GVL;
+    };
+    // TODO: Restructur...
[truncated]

Dec 21 '23 18:12 llvmbot

@llvm/pr-subscribers-backend-risc-v

Author: Alexey Bataev (alexey-bataev) Co-Authored-By: Vineet Kumar (vntkmr) Co-Authored-By: Roger Ferrer Ibáñez (rofirrim) Co-Authored-By: Simon Moll (simoll)

Changes

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

===Tentative Development Roadmap===

Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Differential Revision: https://reviews.llvm.org/D99750

Patch is 101.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76172.diff

24 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4-1)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+16)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+151-8)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+43)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+8-8)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+98-13)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+51)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+65-1)
(added) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll (+142)
(added) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll (+125)
(added) llvm/test/Transforms/LoopVectorize/X86/vectorize-vp-intrinsics.ll (+127)
(added) llvm/test/Transforms/LoopVectorize/X86/vplan-vp-intrinsics.ll (+83)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-gather-scatter.ll (+64)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-interleave.ll (+169)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-iv32.ll (+84)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-masked-loadstore.ll (+81)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-no-masking.ll (+46)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-reverse-load-store.ll (+64)
(added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics.ll (+97)
(added) llvm/test/Transforms/LoopVectorize/vplan-vp-intrinsics.ll (+36)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 735be3680aea0d..e2a127ff35be26 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
   /// Use predicate to control both data and control flow, but modify
   /// the trip count so that a runtime overflow check can be avoided
   /// and such that the scalar epilogue loop can always be removed.
-  DataAndControlFlowWithoutRuntimeCheck
+  DataAndControlFlowWithoutRuntimeCheck,
+  /// Use predicated EVL instructions for tail-folding.
+  /// Indicates that VP intrinsics should be used if tail-folding is enabled.
+  DataWithEVL,
 };
 
 struct TailFoldingInfo {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4614446b2150b7..1a9abaea811159 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
   return TTI::TCC_Free;
 }
 
+bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
+  return ST->hasVInstructions();
+}
+
 TargetTransformInfo::PopcntSupportKind
 RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
   assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 96ecc771863e56..d2592be75000de 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                       const APInt &Imm, Type *Ty,
                                       TTI::TargetCostKind CostKind);
 
+  /// \name Vector Predication Information
+  /// Whether the target supports the %evl parameter of VP intrinsic efficiently
+  /// in hardware, for the given opcode and type/alignment. (see LLVM Language
+  /// Reference - "Vector Predication Intrinsics",
+  /// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
+  /// "IR-level VP intrinsics",
+  /// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
+  /// \param Opcode the opcode of the instruction checked for predicated version
+  /// support.
+  /// \param DataType the type of the instruction with the \p Opcode checked for
+  /// prediction support.
+  /// \param Alignment the alignment for memory access operation checked for
+  /// predicated version support.
+  bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+                             Align Alignment) const;
+
   TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
 
   bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..7b0e268877ded3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -123,6 +123,7 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
+#include "llvm/IR/VectorBuilder.h"
 #include "llvm/IR/Verifier.h"
 #include "llvm/Support/Casting.h"
 #include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
         clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
                    "Create lane mask using active.lane.mask intrinsic, and use "
                    "it for both data and control flow"),
-        clEnumValN(
-            TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
-            "data-and-control-without-rt-check",
-            "Similar to data-and-control, but remove the runtime check")));
+        clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
+                   "data-and-control-without-rt-check",
+                   "Similar to data-and-control, but remove the runtime check"),
+        clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
+                   "Use predicated EVL instructions for tail folding if the "
+                   "target supports vector length predication")));
 
 static cl::opt<bool> MaximizeBandwidth(
     "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1106,8 +1109,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
       if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
           isa<VPInterleaveRecipe>(CurRec) ||
           isa<VPScalarIVStepsRecipe>(CurRec) ||
-          isa<VPCanonicalIVPHIRecipe>(CurRec) ||
-          isa<VPActiveLaneMaskPHIRecipe>(CurRec))
+          isa<VPHeaderPHIRecipe>(CurRec))
         continue;
 
       // This recipe contributes to the address computation of a widen
@@ -1655,6 +1657,23 @@ class LoopVectorizationCostModel {
     return foldTailByMasking() || Legal->blockNeedsPredication(BB);
   }
 
+  /// Returns true if VP intrinsics with explicit vector length support should
+  /// be generated in the tail folded loop.
+  bool useVPIWithVPEVLVectorization() const {
+    return PreferEVL && !EnableVPlanNativePath &&
+           getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+           // FIXME: implement support for max safe dependency distance.
+           Legal->isSafeForAnyVectorWidth() &&
+           // FIXME: remove this once reductions are supported.
+           Legal->getReductionVars().empty() &&
+           // FIXME: remove this once vp_reverse is supported.
+           none_of(
+               WideningDecisions,
+               [](const std::pair<std::pair<Instruction *, ElementCount>,
+                                  std::pair<InstWidening, InstructionCost>>
+                      &Data) { return Data.second.first == CM_Widen_Reverse; });
+  }
+
   /// Returns true if the Phi is part of an inloop reduction.
   bool isInLoopReduction(PHINode *Phi) const {
     return InLoopReductions.contains(Phi);
@@ -1800,6 +1819,10 @@ class LoopVectorizationCostModel {
   /// All blocks of loop are to be masked to fold tail of scalar iterations.
   bool CanFoldTailByMasking = false;
 
+  /// Control whether to generate VP intrinsics with explicit-vector-length
+  /// support in vectorized code.
+  bool PreferEVL = false;
+
   /// A map holding scalar costs for different vectorization factors. The
   /// presence of a cost for an instruction in the mapping indicates that the
   /// instruction will be scalarized when vectorizing with the associated
@@ -4883,6 +4906,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
   if (Legal->prepareToFoldTailByMasking()) {
     CanFoldTailByMasking = true;
+    if (getTailFoldingStyle() == TailFoldingStyle::None)
+      return MaxFactors;
+
+    if (UserIC > 1) {
+      LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                           "not generate VP intrinsics since interleave count "
+                           "specified is greater than 1.\n");
+      return MaxFactors;
+    }
+
+    if (MaxFactors.ScalableVF.isVector()) {
+      assert(MaxFactors.ScalableVF.isScalable() &&
+             "Expected scalable vector factor.");
+      // FIXME: use actual opcode/data type for analysis here.
+      PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+                  TTI.hasActiveVectorLength(0, nullptr, Align());
+#if !NDEBUG
+      if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
+        if (PreferEVL)
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "try to generate VP Intrinsics.\n";
+        else
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "not try to generate VP Intrinsics since the target "
+                    "does not support vector length predication.\n";
+      }
+#endif // !NDEBUG
+
+      // Tail folded loop using VP intrinsics restricts the VF to be scalable.
+      if (PreferEVL)
+        MaxFactors.FixedVF = ElementCount::getFixed(1);
+    }
+
     return MaxFactors;
   }
 
@@ -5493,6 +5549,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   if (!isScalarEpilogueAllowed())
     return 1;
 
+  // Do not interleave if EVL is preferred and no User IC is specified.
+  if (useVPIWithVPEVLVectorization())
+    return 1;
+
   // We used the distance for the interleave count.
   if (!Legal->isSafeForAnyVectorWidth())
     return 1;
@@ -8622,6 +8682,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
         VPlanTransforms::truncateToMinimalBitwidths(
             *Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
       VPlanTransforms::optimize(*Plan, *PSE.getSE());
+      if (CM.useVPIWithVPEVLVectorization())
+        VPlanTransforms::addExplicitVectorLength(*Plan);
       assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
       VPlans.push_back(std::move(Plan));
     }
@@ -9454,6 +9516,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
       State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
 }
 
+/// Creates either vp_store or vp_scatter intrinsics calls to represent
+/// predicated store/scatter.
+static Instruction *
+lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
+                                Value *StoredVal, bool IsScatter, Value *Mask,
+                                Value *EVLPart, const Align &Alignment) {
+  CallInst *Call;
+  if (IsScatter) {
+    Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
+                                   Intrinsic::vp_scatter,
+                                   {StoredVal, Addr, Mask, EVLPart});
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
+        {StoredVal, Addr}));
+  }
+  Call->addParamAttr(
+      1, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
+/// Creates either vp_load or vp_gather intrinsics calls to represent
+/// predicated load/gather.
+static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
+                                                   VectorType *DataTy,
+                                                   Value *Addr, bool IsGather,
+                                                   Value *Mask, Value *EVLPart,
+                                                   const Align &Alignment) {
+  CallInst *Call;
+  if (IsGather) {
+    Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
+                                   {Addr, Mask, EVLPart}, nullptr,
+                                   "wide.masked.gather");
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Load, DataTy, Addr, "vp.op.load"));
+  }
+  Call->addParamAttr(
+      0, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
 void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
 
@@ -9523,6 +9631,12 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     return PartPtr;
   };
 
+  auto MaskValue = [&](unsigned Part) -> Value * {
+    if (isMaskRequired)
+      return BlockInMaskParts[Part];
+    return nullptr;
+  };
+
   // Handle Stores:
   if (SI) {
     State.setDebugLocFrom(SI->getDebugLoc());
@@ -9530,7 +9644,22 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     for (unsigned Part = 0; Part < State.UF; ++Part) {
       Instruction *NewSI = nullptr;
       Value *StoredVal = State.get(StoredValue, Part);
-      if (CreateGatherScatter) {
+      if (State.EVL) {
+        Value *EVLPart = State.get(State.EVL, Part);
+        // If EVL is not nullptr, then EVL must be a valid value set during plan
+        // creation, possibly default value = whole vector register length. EVL
+        // is created only if TTI prefers predicated vectorization, thus if EVL
+        // is not nullptr it also implies preference for predicated
+        // vectorization.
+        // FIXME: Support reverse store after vp_reverse is added.
+        NewSI = lowerStoreUsingVectorIntrinsics(
+            Builder,
+            CreateGatherScatter
+                ? State.get(getAddr(), Part)
+                : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+            StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
+            Alignment);
+      } else if (CreateGatherScatter) {
         Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
         Value *VectorGep = State.get(getAddr(), Part);
         NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9561,7 +9690,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   State.setDebugLocFrom(LI->getDebugLoc());
   for (unsigned Part = 0; Part < State.UF; ++Part) {
     Value *NewLI;
-    if (CreateGatherScatter) {
+    if (State.EVL) {
+      Value *EVLPart = State.get(State.EVL, Part);
+      // If EVL is not nullptr, then EVL must be a valid value set during plan
+      // creation, possibly default value = whole vector register length. EVL
+      // is created only if TTI prefers predicated vectorization, thus if EVL
+      // is not nullptr it also implies preference for predicated
+      // vectorization.
+      // FIXME: Support reverse loading after vp_reverse is added.
+      NewLI = lowerLoadUsingVectorIntrinsics(
+          Builder, DataTy,
+          CreateGatherScatter
+              ? State.get(getAddr(), Part)
+              : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+          CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
+    } else if (CreateGatherScatter) {
       Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
       Value *VectorGep = State.get(getAddr(), Part);
       NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 94cb7688981361..0ca668abbe60c7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -242,6 +242,12 @@ struct VPTransformState {
   ElementCount VF;
   unsigned UF;
 
+  /// If EVL is not nullptr, then EVL must be a valid value set during plan
+  /// creation, possibly a default value = whole vector register length. EVL is
+  /// created only if TTI prefers predicated vectorization, thus if EVL is
+  /// not nullptr it also implies preference for predicated vectorization.
+  VPValue *EVL = nullptr;
+
   /// Hold the indices to generate specific scalar instructions. Null indicates
   /// that all instances are to be generated, using either scalar or vector
   /// instructions.
@@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     SLPLoad,
     SLPStore,
     ActiveLaneMask,
+    ExplicitVectorLength,
+    ExplicitVectorLengthIVIncrement,
     CalculateTripCountMinusVF,
     // Increment the canonical IV separately for each unrolled part.
     CanonicalIVIncrementForPart,
@@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     default:
       return false;
     case VPInstruction::ActiveLaneMask:
+    case VPInstruction::ExplicitVectorLength:
+    case VPInstruction::ExplicitVectorLengthIVIncrement:
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
     case VPInstruction::BranchOnCount:
@@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
 #endif
 };
 
+/// A recipe for generating the phi node for the current index of elements,
+/// adjusted in accordance with EVL value. It starts at StartIV value and gets
+/// incremented by EVL in each iteration of the vector loop.
+class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+public:
+  VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
+      : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
+
+  ~VPEVLBasedIVPHIRecipe() override = default;
+
+  VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+
+  static inline bool classof(const VPHeaderPHIRecipe *D) {
+    return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
+  }
+
+  /// Generate phi for handling IV based on EVL over iterations correctly.
+  void execute(VPTransformState &State) override;
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return true;
+  }
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+};
+
 /// A Recipe for widening the canonical induction variable of the vector loop.
 class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
 public:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 97a8a1803bbf5a..b8ed256d236a4b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
   Type *ResultTy =
       TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
           .Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
-                VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
-              [this](const auto *R) {
-                // Handle header phi recipes, except VPWienIntOrFpInduction
-                // which needs special handling due it being possibly truncated.
-                // TODO: consider inferring/caching type of siblings, e.g.,
-                // backedge value, here and in cases below.
-                return inferScalarType(R->getStartValue());
-              })
+                VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
+                VPEVLBasedIVPHIRecipe>([this](const auto *R) {
+            // Handle header phi recipes, except VPWienIntOrFpInduction
+            // which needs special handling due it being possibly truncated.
+            // TODO: consider inferring/caching type of siblings, e.g.,
+            // backedge value, here and in cases below.
+            return inferScalarType(R->getStartValue());
+          })
           .Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
               [](const auto *R) { return R->getScalarType(); })
           .Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 02e400d590bed4..5e0344a14df5da 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -345,6 +345,44 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
     Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
     return Builder.CreateSelect(Cmp, Sub, Zero);
   }
+  case VPInstruction::ExplicitVectorLength: {
+    // Compute EVL
+    auto GetSetVL = [=](VPTransformState &State, Value *EVL) {
+      assert(EVL->getType()->isIntegerTy() &&
+             "Requested vector length should be an integer.");
+
+      // TODO: Add support for MaxSafeDist for correct loop emission.
+      Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+
+      Value *GVL = State.Builder.CreateIntrinsic(
+          State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+          {EVL, VFArg, State.Builder.getTrue()});
+      return GVL;
+    };
+    // TODO: Restructur...
[truncated]

Dec 21 '23 18:12 llvmbot

:white_check_mark: With the latest revision this PR passed the C/C++ code formatter.

Dec 21 '23 18:12 github-actions[bot]

Ping!

Dec 29 '23 14:12 alexey-bataev

Thanks for moving to Github now that Phabricator has been taken down!

I think @ayalz added some comments shortly before Phabricator was deactivated; unfortunately https://reviews.llvm.org/D99750 isn't accessible at the moment it seems (and it also doesn't seem to be available at http://108.170.204.19/D99750 which is supposed to have a static mirror). I am not sure what's the best way to pick up the recent comments here, perhaps it would be best to share the latest responses here on GH now?

Dec 29 '23 21:12 fhahn

I addressed most of the @ayalz comments in this version

Dec 29 '23 22:12 alexey-bataev

I addressed most of the @ayalz comments in this version

Ok thanks!

It would be helpful to import the recent conversations here and including what has been addressed how in the current iteration and if anything is still left open. Unfortunately it looks like for some reason D99750 isn't included in the static archive of reviews.llvm.org, I posted https://discourse.llvm.org/t/some-reviews-on-reviews-llvm-org-seem-to-be-missing-from-the-static-archive/76001 to hopefully get back access to the context in Phabricator.

Jan 02 '24 13:01 fhahn

Rebase

Jan 05 '24 01:01 alexey-bataev

The update regarding AVL/EVL. I missed one point here, when we discussed it before.

AVL can be referred as the input parameter for llvm.experimental.get.vector.length intrinsic.
EVL is the result, returned by this intrinsic.

So, this 2 subject are separate. For this reason the corresponding parameter in LLVM IR Reference manual (https://llvm.org/docs/LangRef.html) for VP-based intrinsics is named as %evl and we use it here as EVL.

Jan 22 '24 18:01 alexey-bataev

Rebase + merged the check lines in the tests

Jan 22 '24 18:01 alexey-bataev

The review on Phabricator is now available on the static archive: https://reviews.llvm.org/D99750

Went through @ayalz 's latest comments and shared the one that still seems pending/open. There was quite a lot to go over, so I might have missed some comments.

Also added some more comments inline. In terms of further refactoring for this patch, would be good to remove the dedicated EVLIncrement opcode now that #78113 landed, if possible.

The other larger pending suggestion is related to EVL handling in the recipes; I suggest to add a TODO and address this as follow-up, unless @ayalz prefers doing the refactoring first. I am planning to split up/refactor memory recipe soon now that address computation is already moved out.

Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".

Jan 26 '24 17:01 alexey-bataev

Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".

Can you push that commit somewhere so I can have a look? Looks like only-first-lane-used analysis might need some additional info.

Just replace VPInstruction::ExplicitVectorLengthIVIncrement with Instruction::Add in lib/Transforms/Vectorize/VPlanTransforms.cpp, line 1273

Jan 26 '24 21:01 alexey-bataev

Oh right, surprised it is already used by PPC. EVL LV won't work on PPC due to them not using scalable vectors? At least I cannot find a test that uses vscale. But would it make sense to support EVL on PPC, if it supports active-vector-length?

Currently it won't work for PPC, since it has some specific checks in TTI. We (or PPC developers) can enable it later.

Jan 29 '24 19:01 alexey-bataev

Yep, I meant it looks like the comment needs updating, as it says it is set during plan creation.

Changed it to "during plan transformation"

Jan 29 '24 19:01 alexey-bataev

When introducing the EVL, the transform should do something like Plan->setUF(1)

It does. It adds UF=1 to Ufs, but VPlan print does not print the list of Ufs, it just checks the size and if UFs is not empty, prints "UF>=1". "UF=1" is printed only if UFs is empty.

Jan 29 '24 19:01 alexey-bataev

Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".

Ah yes thanks for checking, had a look at this and there's also another issue, that is how some VPInstructions access and store scalar values as vector values in VPTransformState, which doesn't interact well with other recipes in general. I'll soon share patches to address those 2 shortcomings, but I'd expect that they will take some time to land. Don't think we need to block this patch until that is done, replacing the custom opcode with ::Add can be done when they land.

The latest version of the patch has diverged quite a bit from the description I think, could you update the title & description to reflect the current scope of the patch?

I checked the description, did not find big difference between the description and the code. Could you point explicitly, which part must be updated?

Jan 31 '24 16:01 alexey-bataev

I checked the description, did not find big difference between the description and the code. Could you point explicitly, which part must be updated?

For the title, I'd suggest explicitly stating that this adds a new tail-folding mode using EVL
Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.
Would be good to mention this patch 1) adds a new VPlanTransforms to replace the wide header predicate compare with EVL and 2) updates codegen for load/stores to use VP store/load with EVL.
Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.
Better to just state what option is chosen for the implementation (or at least make it explicit that the patch only adds the 3rd one?)

I'd also drop the tentative roadmap and mention the current limitation (UF = 1, no reductions, dep-distance restrictions, no reverse access)

Done

Jan 31 '24 21:01 alexey-bataev

Ping!

Feb 02 '24 18:02 alexey-bataev

In the process of going over this again, posting a first set of comments to pipeline the work.

The second paragraph of the commit message appears broken at the end.

Fixed, no idea what's happened

Feb 05 '24 19:02 alexey-bataev

Ping!

Feb 12 '24 20:02 alexey-bataev

Thanks for the recent update! Only a further suggestion regarding tail-folding style handling and moving that to a separate NFC PR in preparation for this one

Please +1 if remaining changes can be done in a separate PR.

@ayalz please update if you have further comments.

Feb 13 '24 22:02 appujee

Thanks for the recent update! Only a further suggestion regarding tail-folding style handling and moving that to a separate NFC PR in preparation for this one

Please +1 if remaining changes can be done in a separate PR.

In case I wasn't clear above, the suggestion is to do the refactoring of tail-folding style selection as separate PR and have this PR depend on the refactoring PR.

Feb 13 '24 22:02 fhahn

Thanks for the recent update! Only a further suggestion regarding tail-folding style handling and moving that to a separate NFC PR in preparation for this one

Please +1 if remaining changes can be done in a separate PR.

In case I wasn't clear above, the suggestion is to do the refactoring of tail-folding style selection as separate PR and have this PR depend on the refactoring PR.

https://github.com/llvm/llvm-project/pull/81885

Feb 15 '24 23:02 alexey-bataev

Ping!

Mar 06 '24 18:03 alexey-bataev

Ping!

Mar 07 '24 15:03 alexey-bataev

@alexey-bataev @fhahn

As Ayal has not responded to the patch in weeks (https://github.com/llvm/llvm-project/pull/76172#issuecomment-1942735278) even after pinging, should we follow up in a post commit review if and when Ayal has feedback?

Mar 07 '24 15:03 appujee

Ping!

Mar 12 '24 18:03 alexey-bataev

Ping!

Mar 14 '24 12:03 alexey-bataev

Ping!

Mar 18 '24 12:03 alexey-bataev

Ping!

Mar 19 '24 20:03 alexey-bataev

llvm-project llvm-project copied to clipboard

[LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL.

llvm-project
llvm-project copied to clipboard