llvm-project [RISCV] Allow folding vmerge with implicit passthru when true has tied dest

We currently don't fold a vmerge if it has an implicit passthru and its true operand also has a passthru (i.e. tied dest).

This restriction was added in https://reviews.llvm.org/D151596, back whenever we had separate TU/TA pseudos. It looks like it was added because the policy might not have been handled correctly.

However the policy should be set correctly if we relax this restriction today, since we compute the policy differently now that we have removed the TU/TA distinction in our pseudos.

We use a TUMU policy, and relax it to TAMU iff the vmerge's passthru is implicit.

The reasoning behind this being that the tail elements always come from the vmerge's passthru[^1], so if vmerge's passthru is implicit then the tail is also implicit, and hence tail agnostic.

[^1]: unless the VL was shrunk, but in this case which case we conservatively use TUMU.

[!NOTE] The tests were included in a separate commit this PR so reviewers can see the diff

Jan 18 '24 11:01 lukel97

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

We currently don't fold a vmerge if it has an implicit merge operand and its true operand has a tied dest (i.e. has a passthru operand).

This restriction was added in https://reviews.llvm.org/D151596, back whenever we had separate TU/TA pseudos. It looks like it was added because the policy might not have been handled correctly.

However the policy should be set correctly if we relax this restriction today, since we compute the policy differently now that we have removed the TU/TA distinction in our pseudos.

We use a TUMU policy, and relax it to TAMU iff the vmerge's merge operand is implicit.

The reasoning behind this being that the tail elements always come from the vmerge's merge operand[^1], so if the merge operand is implicit-def then the tail is implicit-def, and hence tail agnostic.

[^1]: unless the VL was shrunk, in which case we conservatively use TUMU.

> [!NOTE] > The tests were included in a separate commit this PR so reviewers can see the diff

Full diff: https://github.com/llvm/llvm-project/pull/78565.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp (-8)
(modified) llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll (+38)
(modified) llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll (+44-66)

diff --git a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
index 0d8688ba2eaeaff..5b20a2014cc98ca 100644
--- a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
@@ -3531,11 +3531,6 @@ bool RISCVDAGToDAGISel::performCombineVMergeAndVOps(SDNode *N) {
     return false;
 
   if (HasTiedDest && !isImplicitDef(True->getOperand(0))) {
-    // The vmerge instruction must be TU.
-    // FIXME: This could be relaxed, but we need to handle the policy for the
-    // resulting op correctly.
-    if (isImplicitDef(Merge))
-      return false;
     SDValue MergeOpTrue = True->getOperand(0);
     // Both the vmerge instruction and the True instruction must have the same
     // merge operand.
@@ -3545,9 +3540,6 @@ bool RISCVDAGToDAGISel::performCombineVMergeAndVOps(SDNode *N) {
 
   if (IsMasked) {
     assert(HasTiedDest && "Expected tied dest");
-    // The vmerge instruction must be TU.
-    if (isImplicitDef(Merge))
-      return false;
     // The vmerge instruction must have an all 1s mask since we're going to keep
     // the mask from the True instruction.
     // FIXME: Support mask agnostic True instruction which would have an
diff --git a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
index c639f092444fc43..b3166bf5b01071d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
@@ -1187,3 +1187,41 @@ define <vscale x 2 x i32> @vmerge_larger_vl_false_becomes_tail(<vscale x 2 x i32
   %b = call <vscale x 2 x i32> @llvm.riscv.vmerge.nxv2i32.nxv2i32(<vscale x 2 x i32> poison, <vscale x 2 x i32> %false, <vscale x 2 x i32> %a, <vscale x 2 x i1> %m, i64 3)
   ret <vscale x 2 x i32> %b
 }
+
+declare <vscale x 2 x i32> @llvm.riscv.vmacc.nxv2i32.nxv2i32(<vscale x 2 x i32>, <vscale x 2 x i32>, <vscale x 2 x i32>, i64, i64)
+
+define <vscale x 2 x i32> @true_tied_dest_vmerge_implicit_passthru(<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %m, i64 %avl) {
+; CHECK-LABEL: true_tied_dest_vmerge_implicit_passthru:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, mu
+; CHECK-NEXT:    vmacc.vv v8, v9, v10, v0.t
+; CHECK-NEXT:    ret
+  %a = call <vscale x 2 x i32> @llvm.riscv.vmacc.nxv2i32.nxv2i32(<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, i64 %avl, i64 0)
+  %b = call <vscale x 2 x i32> @llvm.riscv.vmerge.nxv2i32.nxv2i32(
+    <vscale x 2 x i32> poison,
+    <vscale x 2 x i32> %passthru,
+    <vscale x 2 x i32> %a,
+    <vscale x 2 x i1> %m,
+    i64 %avl
+  )
+  ret <vscale x 2 x i32> %b
+}
+
+declare <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32.i32(<vscale x 2 x i32>, <vscale x 2 x i32>, i32, <vscale x 2 x i1>, i64, i64)
+
+define <vscale x 2 x i32> @true_mask_vmerge_implicit_passthru(<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %m, i64 %avl) {
+; CHECK-LABEL: true_mask_vmerge_implicit_passthru:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, mu
+; CHECK-NEXT:    vadd.vv v8, v9, v10, v0.t
+; CHECK-NEXT:    ret
+  %a = call <vscale x 2 x i32> @llvm.riscv.vadd.mask.nxv2i32.nxv2i32(<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %m, i64 %avl, i64 0)
+  %b = call <vscale x 2 x i32> @llvm.riscv.vmerge.nxv2i32.nxv2i32(
+    <vscale x 2 x i32> poison,
+    <vscale x 2 x i32> %passthru,
+    <vscale x 2 x i32> %a,
+    <vscale x 2 x i1> shufflevector(<vscale x 2 x i1> insertelement(<vscale x 2 x i1> poison, i1 true, i32 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer),
+    i64 %avl
+  )
+  ret <vscale x 2 x i32> %b
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll b/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
index 6a6d7d2a41424c5..d699ee3f1dc29fc 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vmadd-vp.ll
@@ -91,9 +91,8 @@ define <vscale x 1 x i8> @vmadd_vv_nxv1i8_ta(<vscale x 1 x i8> %a, <vscale x 1 x
 define <vscale x 1 x i8> @vmadd_vx_nxv1i8_ta(<vscale x 1 x i8> %a, i8 %b, <vscale x 1 x i8> %c,  <vscale x 1 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv1i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, mf8, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, mf8, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 1 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 1 x i8> %elt.head, <vscale x 1 x i8> poison, <vscale x 1 x i32> zeroinitializer
@@ -192,9 +191,8 @@ define <vscale x 2 x i8> @vmadd_vv_nxv2i8_ta(<vscale x 2 x i8> %a, <vscale x 2 x
 define <vscale x 2 x i8> @vmadd_vx_nxv2i8_ta(<vscale x 2 x i8> %a, i8 %b, <vscale x 2 x i8> %c,  <vscale x 2 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv2i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, mf4, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, mf4, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 2 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 2 x i8> %elt.head, <vscale x 2 x i8> poison, <vscale x 2 x i32> zeroinitializer
@@ -293,9 +291,8 @@ define <vscale x 4 x i8> @vmadd_vv_nxv4i8_ta(<vscale x 4 x i8> %a, <vscale x 4 x
 define <vscale x 4 x i8> @vmadd_vx_nxv4i8_ta(<vscale x 4 x i8> %a, i8 %b, <vscale x 4 x i8> %c,  <vscale x 4 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv4i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, mf2, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, mf2, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 4 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 4 x i8> %elt.head, <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer
@@ -394,9 +391,8 @@ define <vscale x 8 x i8> @vmadd_vv_nxv8i8_ta(<vscale x 8 x i8> %a, <vscale x 8 x
 define <vscale x 8 x i8> @vmadd_vx_nxv8i8_ta(<vscale x 8 x i8> %a, i8 %b, <vscale x 8 x i8> %c,  <vscale x 8 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv8i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, m1, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, m1, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 8 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 8 x i8> %elt.head, <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer
@@ -495,9 +491,8 @@ define <vscale x 16 x i8> @vmadd_vv_nxv16i8_ta(<vscale x 16 x i8> %a, <vscale x
 define <vscale x 16 x i8> @vmadd_vx_nxv16i8_ta(<vscale x 16 x i8> %a, i8 %b, <vscale x 16 x i8> %c,  <vscale x 16 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv16i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, m2, ta, ma
-; CHECK-NEXT:    vmacc.vx v10, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v10, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, m2, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v10, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 16 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 16 x i8> %elt.head, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
@@ -596,9 +591,8 @@ define <vscale x 32 x i8> @vmadd_vv_nxv32i8_ta(<vscale x 32 x i8> %a, <vscale x
 define <vscale x 32 x i8> @vmadd_vx_nxv32i8_ta(<vscale x 32 x i8> %a, i8 %b, <vscale x 32 x i8> %c,  <vscale x 32 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv32i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, m4, ta, ma
-; CHECK-NEXT:    vmacc.vx v12, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v12, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, m4, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v12, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 32 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 32 x i8> %elt.head, <vscale x 32 x i8> poison, <vscale x 32 x i32> zeroinitializer
@@ -700,9 +694,8 @@ define <vscale x 64 x i8> @vmadd_vv_nxv64i8_ta(<vscale x 64 x i8> %a, <vscale x
 define <vscale x 64 x i8> @vmadd_vx_nxv64i8_ta(<vscale x 64 x i8> %a, i8 %b, <vscale x 64 x i8> %c,  <vscale x 64 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv64i8_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e8, m8, ta, ma
-; CHECK-NEXT:    vmacc.vx v16, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v16, v0
+; CHECK-NEXT:    vsetvli zero, a1, e8, m8, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v16, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 64 x i8> poison, i8 %b, i32 0
   %vb = shufflevector <vscale x 64 x i8> %elt.head, <vscale x 64 x i8> poison, <vscale x 64 x i32> zeroinitializer
@@ -801,9 +794,8 @@ define <vscale x 1 x i16> @vmadd_vv_nxv1i16_ta(<vscale x 1 x i16> %a, <vscale x
 define <vscale x 1 x i16> @vmadd_vx_nxv1i16_ta(<vscale x 1 x i16> %a, i16 %b, <vscale x 1 x i16> %c,  <vscale x 1 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv1i16_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e16, mf4, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e16, mf4, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 1 x i16> poison, i16 %b, i32 0
   %vb = shufflevector <vscale x 1 x i16> %elt.head, <vscale x 1 x i16> poison, <vscale x 1 x i32> zeroinitializer
@@ -902,9 +894,8 @@ define <vscale x 2 x i16> @vmadd_vv_nxv2i16_ta(<vscale x 2 x i16> %a, <vscale x
 define <vscale x 2 x i16> @vmadd_vx_nxv2i16_ta(<vscale x 2 x i16> %a, i16 %b, <vscale x 2 x i16> %c,  <vscale x 2 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv2i16_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e16, mf2, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e16, mf2, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 2 x i16> poison, i16 %b, i32 0
   %vb = shufflevector <vscale x 2 x i16> %elt.head, <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer
@@ -1003,9 +994,8 @@ define <vscale x 4 x i16> @vmadd_vv_nxv4i16_ta(<vscale x 4 x i16> %a, <vscale x
 define <vscale x 4 x i16> @vmadd_vx_nxv4i16_ta(<vscale x 4 x i16> %a, i16 %b, <vscale x 4 x i16> %c,  <vscale x 4 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv4i16_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 4 x i16> poison, i16 %b, i32 0
   %vb = shufflevector <vscale x 4 x i16> %elt.head, <vscale x 4 x i16> poison, <vscale x 4 x i32> zeroinitializer
@@ -1104,9 +1094,8 @@ define <vscale x 8 x i16> @vmadd_vv_nxv8i16_ta(<vscale x 8 x i16> %a, <vscale x
 define <vscale x 8 x i16> @vmadd_vx_nxv8i16_ta(<vscale x 8 x i16> %a, i16 %b, <vscale x 8 x i16> %c,  <vscale x 8 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv8i16_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e16, m2, ta, ma
-; CHECK-NEXT:    vmacc.vx v10, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v10, v0
+; CHECK-NEXT:    vsetvli zero, a1, e16, m2, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v10, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 8 x i16> poison, i16 %b, i32 0
   %vb = shufflevector <vscale x 8 x i16> %elt.head, <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
@@ -1205,9 +1194,8 @@ define <vscale x 16 x i16> @vmadd_vv_nxv16i16_ta(<vscale x 16 x i16> %a, <vscale
 define <vscale x 16 x i16> @vmadd_vx_nxv16i16_ta(<vscale x 16 x i16> %a, i16 %b, <vscale x 16 x i16> %c,  <vscale x 16 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv16i16_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e16, m4, ta, ma
-; CHECK-NEXT:    vmacc.vx v12, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v12, v0
+; CHECK-NEXT:    vsetvli zero, a1, e16, m4, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v12, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 16 x i16> poison, i16 %b, i32 0
   %vb = shufflevector <vscale x 16 x i16> %elt.head, <vscale x 16 x i16> poison, <vscale x 16 x i32> zeroinitializer
@@ -1309,9 +1297,8 @@ define <vscale x 32 x i16> @vmadd_vv_nxv32i16_ta(<vscale x 32 x i16> %a, <vscale
 define <vscale x 32 x i16> @vmadd_vx_nxv32i16_ta(<vscale x 32 x i16> %a, i16 %b, <vscale x 32 x i16> %c,  <vscale x 32 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv32i16_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e16, m8, ta, ma
-; CHECK-NEXT:    vmacc.vx v16, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v16, v0
+; CHECK-NEXT:    vsetvli zero, a1, e16, m8, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v16, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 32 x i16> poison, i16 %b, i32 0
   %vb = shufflevector <vscale x 32 x i16> %elt.head, <vscale x 32 x i16> poison, <vscale x 32 x i32> zeroinitializer
@@ -1410,9 +1397,8 @@ define <vscale x 1 x i32> @vmadd_vv_nxv1i32_ta(<vscale x 1 x i32> %a, <vscale x
 define <vscale x 1 x i32> @vmadd_vx_nxv1i32_ta(<vscale x 1 x i32> %a, i32 %b, <vscale x 1 x i32> %c,  <vscale x 1 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv1i32_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e32, mf2, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e32, mf2, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 1 x i32> poison, i32 %b, i32 0
   %vb = shufflevector <vscale x 1 x i32> %elt.head, <vscale x 1 x i32> poison, <vscale x 1 x i32> zeroinitializer
@@ -1511,9 +1497,8 @@ define <vscale x 2 x i32> @vmadd_vv_nxv2i32_ta(<vscale x 2 x i32> %a, <vscale x
 define <vscale x 2 x i32> @vmadd_vx_nxv2i32_ta(<vscale x 2 x i32> %a, i32 %b, <vscale x 2 x i32> %c,  <vscale x 2 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv2i32_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
-; CHECK-NEXT:    vmacc.vx v9, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v9, v0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 2 x i32> poison, i32 %b, i32 0
   %vb = shufflevector <vscale x 2 x i32> %elt.head, <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
@@ -1612,9 +1597,8 @@ define <vscale x 4 x i32> @vmadd_vv_nxv4i32_ta(<vscale x 4 x i32> %a, <vscale x
 define <vscale x 4 x i32> @vmadd_vx_nxv4i32_ta(<vscale x 4 x i32> %a, i32 %b, <vscale x 4 x i32> %c,  <vscale x 4 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv4i32_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e32, m2, ta, ma
-; CHECK-NEXT:    vmacc.vx v10, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v10, v0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m2, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v10, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 4 x i32> poison, i32 %b, i32 0
   %vb = shufflevector <vscale x 4 x i32> %elt.head, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
@@ -1713,9 +1697,8 @@ define <vscale x 8 x i32> @vmadd_vv_nxv8i32_ta(<vscale x 8 x i32> %a, <vscale x
 define <vscale x 8 x i32> @vmadd_vx_nxv8i32_ta(<vscale x 8 x i32> %a, i32 %b, <vscale x 8 x i32> %c,  <vscale x 8 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv8i32_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e32, m4, ta, ma
-; CHECK-NEXT:    vmacc.vx v12, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v12, v0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m4, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v12, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 8 x i32> poison, i32 %b, i32 0
   %vb = shufflevector <vscale x 8 x i32> %elt.head, <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
@@ -1817,9 +1800,8 @@ define <vscale x 16 x i32> @vmadd_vv_nxv16i32_ta(<vscale x 16 x i32> %a, <vscale
 define <vscale x 16 x i32> @vmadd_vx_nxv16i32_ta(<vscale x 16 x i32> %a, i32 %b, <vscale x 16 x i32> %c,  <vscale x 16 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vmadd_vx_nxv16i32_ta:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a1, e32, m8, ta, ma
-; CHECK-NEXT:    vmacc.vx v16, a0, v8
-; CHECK-NEXT:    vmerge.vvm v8, v8, v16, v0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m8, ta, mu
+; CHECK-NEXT:    vmadd.vx v8, a0, v16, v0.t
 ; CHECK-NEXT:    ret
   %elt.head = insertelement <vscale x 16 x i32> poison, i32 %b, i32 0
   %vb = shufflevector <vscale x 16 x i32> %elt.head, <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
@@ -1965,9 +1947,8 @@ define <vscale x 1 x i64> @vmadd_vx_nxv1i64_ta(<vscale x 1 x i64> %a, i64 %b, <v
 ;
 ; RV64-LABEL: vmadd_vx_nxv1i64_ta:
 ; RV64:       # %bb.0:
-; RV64-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
-; RV64-NEXT:    vmacc.vx v9, a0, v8
-; RV64-NEXT:    vmerge.vvm v8, v8, v9, v0
+; RV64-NEXT:    vsetvli zero, a1, e64, m1, ta, mu
+; RV64-NEXT:    vmadd.vx v8, a0, v9, v0.t
 ; RV64-NEXT:    ret
   %elt.head = insertelement <vscale x 1 x i64> poison, i64 %b, i32 0
   %vb = shufflevector <vscale x 1 x i64> %elt.head, <vscale x 1 x i64> poison, <vscale x 1 x i32> zeroinitializer
@@ -2113,9 +2094,8 @@ define <vscale x 2 x i64> @vmadd_vx_nxv2i64_ta(<vscale x 2 x i64> %a, i64 %b, <v
 ;
 ; RV64-LABEL: vmadd_vx_nxv2i64_ta:
 ; RV64:       # %bb.0:
-; RV64-NEXT:    vsetvli zero, a1, e64, m2, ta, ma
-; RV64-NEXT:    vmacc.vx v10, a0, v8
-; RV64-NEXT:    vmerge.vvm v8, v8, v10, v0
+; RV64-NEXT:    vsetvli zero, a1, e64, m2, ta, mu
+; RV64-NEXT:    vmadd.vx v8, a0, v10, v0.t
 ; RV64-NEXT:    ret
   %elt.head = insertelement <vscale x 2 x i64> poison, i64 %b, i32 0
   %vb = shufflevector <vscale x 2 x i64> %elt.head, <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
@@ -2261,9 +2241,8 @@ define <vscale x 4 x i64> @vmadd_vx_nxv4i64_ta(<vscale x 4 x i64> %a, i64 %b, <v
 ;
 ; RV64-LABEL: vmadd_vx_nxv4i64_ta:
 ; RV64:       # %bb.0:
-; RV64-NEXT:    vsetvli zero, a1, e64, m4, ta, ma
-; RV64-NEXT:    vmacc.vx v12, a0, v8
-; RV64-NEXT:    vmerge.vvm v8, v8, v12, v0
+; RV64-NEXT:    vsetvli zero, a1, e64, m4, ta, mu
+; RV64-NEXT:    vmadd.vx v8, a0, v12, v0.t
 ; RV64-NEXT:    ret
   %elt.head = insertelement <vscale x 4 x i64> poison, i64 %b, i32 0
   %vb = shufflevector <vscale x 4 x i64> %elt.head, <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
@@ -2412,9 +2391,8 @@ define <vscale x 8 x i64> @vmadd_vx_nxv8i64_ta(<vscale x 8 x i64> %a, i64 %b, <v
 ;
 ; RV64-LABEL: vmadd_vx_nxv8i64_ta:
 ; RV64:       # %bb.0:
-; RV64-NEXT:    vsetvli zero, a1, e64, m8, ta, ma
-; RV64-NEXT:    vmacc.vx v16, a0, v8
-; RV64-NEXT:    vmerge.vvm v8, v8, v16, v0
+; RV64-NEXT:    vsetvli zero, a1, e64, m8, ta, mu
+; RV64-NEXT:    vmadd.vx v8, a0, v16, v0.t
 ; RV64-NEXT:    ret
   %elt.head = insertelement <vscale x 8 x i64> poison, i64 %b, i32 0
   %vb = shufflevector <vscale x 8 x i64> %elt.head, <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer

Jan 18 '24 12:01 llvmbot

Haven't read the patch, but I'm not convinced of the correctness of the justification.

Consider the following case: VL = 2, SEW = 32, VLEN=128, TU V1= VADD PT, A, B // result is [A[0] + B[0], A[1] + B[1], PT[2], PT[3]] VL=4 V2 = VMERGE undef, V1, C, <true, true, false, true> // result is [A[0] + B[0], A[1] + B[1], C[2], PT[3]]

The transform would create: VL = 2, SEW = 32, VLEN=128, TU/MU V1= VADD C, A, B, <true, true, false, true> // result is [A[0] + B[0], A[1] + B[1], C[2], C[3]]

Note the difference in the last element.

Not sure I got my example entirely correct, but the basic idea I'm aiming for is that if the original sequence produces values which consume 4 distinct input registers, there's no folded form for the VADD which can do the same. Unless I'm missing something?

Jan 19 '24 19:01 preames

We have an additional constraint that C has to be the same as PT though. In order for folding to occur all of these need to be “equivalent”

VADD’s passthru
VMERGE’s passthru
VMERGES’s false

The code structure doesn’t make this particularly clear, and by “equivalent” I mean they all need to be undef or equal to VMERGE’s false IIUC. So we always transform from 3 distinct input registers to 3 input registers: A, B, and VMERGE’s false.

So in the example below, C = PT so the two results would be the same.

Jan 22 '24 15:01 lukel97

The formatting of the above comment got messed up when I replied via email, sorry about that. Should be fixed now.

Jan 22 '24 16:01 lukel97

Rebased, gentle ping

Jan 30 '24 03:01 lukel97

Rebased, ping

Mar 07 '24 08:03 lukel97

Ping

Mar 27 '24 09:03 lukel97

LGTM.

Thanks. Will wait to hear back from @preames as well to make sure the correctness point is addressed

Mar 28 '24 08:03 lukel97

Could this PR replace #78403 ?

Mar 29 '24 12:03 sun-jacobi

Could this PR replace #78403 ?

Not on its own, because the issue in #78403 stems from the fact that DAGCombiner::foldSelectWithIdentityConstant doesn't kick in with the sext/zext in between the binary op, so we don't transform

add a, (zext (vselect Cond, 0, b))

to

vselect Cond, a, (add a, (zext FVal))

So the vselect/vmerge won't be pulled outside the add which is needed for the vmerge peephole.

With that said, I returned to this PR because I was attempting to generalize #78403 by teaching DAGCombiner::foldSelectWithIdentityConstant to look through extends. That way we would have the vmerge pulled outside the add/sub, and the vmerge peephole would automatically take care of folding the mask and we wouldn't need combineVWADDWSelect. And I think we would also be able to handle muls and floating point binary ops for free too. But this PR was needed to preserve some of the test cases in #78403

Mar 29 '24 12:03 lukel97

Thank you for the explanation.

Mar 29 '24 12:03 sun-jacobi

Sending out one last ping before I land this. I've rebased it and done another look over, it still seems correct to me.

This PR allows the transform to happen when both the vmerge and the true node have passsthru operands, AND vmerge's passthru is implicit.

It was most likely restricted previously because when vmerge's passthru is implicit the policy becomes tail agnostic.

But even if both the vmerge and true node have passthru operands, its still correct to do the transform and use a tail agnostic policy.

Because the original vmerge's tail was undef (its passthru was implicit), so we can clobber the resulting node's tail with a tail agnostic policy (the VL is the same)

Jul 04 '24 14:07 lukel97

Sending out one last ping before I land this. I've rebased it and done another look over, it still seems correct to me.

This PR allows the transform to happen when both the vmerge and the true node have passsthru operands, AND vmerge's passthru is implicit.

It was most likely restricted previously because when vmerge's passthru is implicit the policy becomes tail agnostic.

But even if both the vmerge and true node have passthru operands, its still correct to do the transform and use a tail agnostic policy.

Because the original vmerge's tail was undef (its passthru was implicit), so we can clobber the resulting node's tail with a tail agnostic policy (the VL is the same)

Everywhere you wrote "implicit" did you meant "implicit_def"?

Jul 05 '24 16:07 topperc

Everywhere you wrote "implicit" did you meant "implicit_def"?

Yes, but I think nowadays this might actually be NoRegister

Jul 05 '24 17:07 lukel97

LLVM Buildbot has detected a new failure on builder clang-ppc64-aix running on aix-ppc64 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/64/builds/219

Here is the relevant piece of the build log for the reference:

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'lit :: googletest-timeout.py' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 9
not env -u FILECHECK_OPTS "/opt/freeware/bin/python3.9" /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit.py -j1 --order=lexical -v Inputs/googletest-timeout    --param gtest_filter=InfiniteLoopSubTest --timeout=1 > /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/build/utils/lit/tests/Output/googletest-timeout.py.tmp.cmd.out
# executed command: not env -u FILECHECK_OPTS /opt/freeware/bin/python3.9 /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit.py -j1 --order=lexical -v Inputs/googletest-timeout --param gtest_filter=InfiniteLoopSubTest --timeout=1
# .---command stderr------------
# | lit.py: /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 1 seconds was requested on the command line. Forcing timeout to be 1 seconds.
# | Traceback (most recent call last):
# |   File "/home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit/formats/googletest.py", line 304, in post_process_shard_results
# |     testsuites = json.load(f)["testsuites"]
# |   File "/opt/freeware/lib64/python3.9/json/__init__.py", line 293, in load
# |     return loads(fp.read(),
# |   File "/opt/freeware/lib64/python3.9/json/__init__.py", line 346, in loads
# |     return _default_decoder.decode(s)
# |   File "/opt/freeware/lib64/python3.9/json/decoder.py", line 337, in decode
# |     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
# |   File "/opt/freeware/lib64/python3.9/json/decoder.py", line 355, in raw_decode
# |     raise JSONDecodeError("Expecting value", s, err.value) from None
# | json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
# | 
# | During handling of the above exception, another exception occurred:
# | 
# | Traceback (most recent call last):
# |   File "/home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit.py", line 6, in <module>
# |     main()
# |   File "/home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit/main.py", line 129, in main
# |     selected_tests, discovered_tests = GoogleTest.post_process_shard_results(
# |   File "/home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/llvm-project/llvm/utils/lit/lit/formats/googletest.py", line 306, in post_process_shard_results
# |     raise RuntimeError(
# | RuntimeError: Failed to parse json file: /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/build/utils/lit/tests/Inputs/googletest-timeout/DummySubDir/OneTest.py-googletest-timeout-53608920-1-2.json
# | 
# `-----------------------------
# RUN: at line 11
FileCheck --check-prefix=CHECK-INF < /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/build/utils/lit/tests/Output/googletest-timeout.py.tmp.cmd.out /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/build/utils/lit/tests/googletest-timeout.py
# executed command: FileCheck --check-prefix=CHECK-INF /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/build/utils/lit/tests/googletest-timeout.py
# .---command stderr------------
# | /home/powerllvm/powerllvm_env/aix-ppc64/clang-ppc64-aix/build/utils/lit/tests/googletest-timeout.py:34:14: error: CHECK-INF: expected string not found in input
# | # CHECK-INF: Timed Out: 1
# |              ^
# | <stdin>:13:29: note: scanning from here
# | Reached timeout of 1 seconds
# |                             ^
# | <stdin>:15:21: note: possible intended match here
# | TIMEOUT: googletest-timeout :: DummySubDir/OneTest.py/1/2 (2 of 2)
# |                     ^
# | 
# | Input file: <stdin>
...

Jul 06 '24 06:07 llvm-ci

llvm-project llvm-project copied to clipboard

[RISCV] Allow folding vmerge with implicit passthru when true has tied dest

llvm-project
llvm-project copied to clipboard