llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

[X86][AVX] Recognise out of bounds AVX2 shift amounts

Open RKSimon opened this issue 1 year ago • 24 comments

Pulled out of #39822 which was a bit too general.

Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:

  • VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
  • VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).

So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.

e.g.

define <4 x i32> @ashr(<4 x i32> %sh, <4 x i32> %amt) {
  %elt.min.i = tail call <4 x i32> @llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
  %shr = ashr <4 x i32> %sh, %elt.min.i
  ret <4 x i32> %shr
}

->

ashr(int vector[4], unsigned int vector[4]):
        vpbroadcastd    xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
        vpminud xmm1, xmm1, xmm2
        vpsravd xmm0, xmm0, xmm1
        ret

vs

ashr(int vector[4], unsigned int vector[4]):
        vpsravd xmm0, xmm0, xmm1
        ret

Logical shifts are trickier but also foldable:

define <4 x i32> @lshr(<4 x i32> %sh, <4 x i32> %amt) {
  %cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
  %shr = lshr <4 x i32> %sh, %amt
  %0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
  ret <4 x i32> %0
}

define <4 x i32> @lshr2(<4 x i32> %sh, <4 x i32> %amt) {
  %cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
  %0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
  %shr = lshr <4 x i32> %0, %amt
  ret <4 x i32> %shr
}

RKSimon avatar Mar 04 '24 13:03 RKSimon

@llvm/issue-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Pulled out of #39822 which was a bit too general.

Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:

  • VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
  • VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).

So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.

e.g.

define &lt;4 x i32&gt; @<!-- -->ashr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %elt.min.i = tail call &lt;4 x i32&gt; @<!-- -->llvm.umin.v4i32(&lt;4 x i32&gt; %amt, &lt;4 x i32&gt; &lt;i32 31, i32 31, i32 31, i32 31&gt;)
  %shr = ashr &lt;4 x i32&gt; %sh, %elt.min.i
  ret &lt;4 x i32&gt; %shr
}

->

ashr(int vector[4], unsigned int vector[4]):
        vpbroadcastd    xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
        vpminud xmm1, xmm1, xmm2
        vpsravd xmm0, xmm0, xmm1
        ret

vs

ashr(int vector[4], unsigned int vector[4]):
        vpsravd xmm0, xmm0, xmm1
        ret

Logical shifts are trickier but also foldable:

define &lt;4 x i32&gt; @<!-- -->lshr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %shr = lshr &lt;4 x i32&gt; %sh, %amt
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %shr, &lt;4 x i32&gt; zeroinitializer
  ret &lt;4 x i32&gt; %0
}

define &lt;4 x i32&gt; @<!-- -->lshr2(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %sh, &lt;4 x i32&gt; zeroinitializer
  %shr = lshr &lt;4 x i32&gt; %0, %amt
  ret &lt;4 x i32&gt; %shr
}

llvmbot avatar Mar 04 '24 13:03 llvmbot

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. In the comments of the issue, request for it to be assigned to you.
  2. Fix the issue locally.
  3. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  4. Create a Git commit.
  5. Run git clang-format HEAD~1 to format your changes.
  6. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot avatar Mar 04 '24 13:03 llvmbot

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

Pulled out of #39822 which was a bit too general.

Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:

  • VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
  • VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).

So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.

e.g.

define &lt;4 x i32&gt; @<!-- -->ashr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %elt.min.i = tail call &lt;4 x i32&gt; @<!-- -->llvm.umin.v4i32(&lt;4 x i32&gt; %amt, &lt;4 x i32&gt; &lt;i32 31, i32 31, i32 31, i32 31&gt;)
  %shr = ashr &lt;4 x i32&gt; %sh, %elt.min.i
  ret &lt;4 x i32&gt; %shr
}

->

ashr(int vector[4], unsigned int vector[4]):
        vpbroadcastd    xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
        vpminud xmm1, xmm1, xmm2
        vpsravd xmm0, xmm0, xmm1
        ret

vs

ashr(int vector[4], unsigned int vector[4]):
        vpsravd xmm0, xmm0, xmm1
        ret

Logical shifts are trickier but also foldable:

define &lt;4 x i32&gt; @<!-- -->lshr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %shr = lshr &lt;4 x i32&gt; %sh, %amt
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %shr, &lt;4 x i32&gt; zeroinitializer
  ret &lt;4 x i32&gt; %0
}

define &lt;4 x i32&gt; @<!-- -->lshr2(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %sh, &lt;4 x i32&gt; zeroinitializer
  %shr = lshr &lt;4 x i32&gt; %0, %amt
  ret &lt;4 x i32&gt; %shr
}

llvmbot avatar Mar 04 '24 13:03 llvmbot

https://godbolt.org/z/W1r1c4fzY

RKSimon avatar Mar 04 '24 13:03 RKSimon

@RKSimon, I'm interested in working on this issue. Could you confirm if all the mentioned behavior regarding the AVX2 vector shift nodes (X86ISD VSRAV/VSRLV/VSHLV) and the handling of out-of-bounds shift amounts occurs in the llvm/lib/Target/X86 directory?

SahilPatidar avatar Mar 04 '24 15:03 SahilPatidar

I expect all of this to be handled as DAG combines inside X86ISelLowering.cpp

RKSimon avatar Mar 04 '24 15:03 RKSimon

What command did you use for this?

SahilPatidar avatar Mar 04 '24 15:03 SahilPatidar

@RKSimon, I am working on it, so could you please assign it to me?

SahilPatidar avatar Mar 04 '24 16:03 SahilPatidar

I suggest you start with the SRA(X,UMIN(Y,BW-1)) case in combineShiftRightArithmetic

You should be able to use supportedVectorVarShift to detect when the target supports replacing the ISD::SRA node with X86ISD::VSRAV

RKSimon avatar Mar 04 '24 16:03 RKSimon

@RKSimon, IWhile reviewing the X86ISelLowering codebase, I encountered a scenario where the code executes the SRA operation and calls the combineShiftRightArithmetic function. However, when the following condition is met:

if (VT.isVector() || N1.getOpcode() != ISD::Constant ||
    N0.getOpcode() != ISD::SHL || !N0.hasOneUse() ||
    N0.getOperand(1).getOpcode() != ISD::Constant)
  return SDValue();

the code returns without performing any optimization.

My observation is that even though we are working with vectors, the code does not invoke the combineVectorShiftImm function in these cases.

I would like to understand why the code behaves this way and does not call the combineVectorShiftImm function when dealing with vectors. Could you please provide some insights into this behavior?

SahilPatidar avatar Mar 04 '24 17:03 SahilPatidar

The code below that early-out is for a single fold (see the description comment) - just above it is the combineShiftToPMULH fold - you can put your code after combineShiftToPMULH.

RKSimon avatar Mar 04 '24 18:03 RKSimon

@RKSimon I have been working on a particular piece of code in LLVM and encountered an unexpected behavior that I'm seeking clarification on.

APInt ShiftAmt;
SDNode *NNode = N1.getNode();
if (NNode->getOpcode() == ISD::UMIN &&
    ISD::isConstantSplatVector(NNode->getOperand(1).getNode(), ShiftAmt) && 
    ShiftAmt == VT.getScalarSizeInBits() - 1) {
  SDValue SHAmtVal = NNode->getOperand(0);
  SDLoc DL(N);
  return DAG.getNode(N->getOpcode(), DL, N->getVTList(), N0, SHAmtVal);
}

When running this code, I observed the following unexpected output:

vpsravd %xmm1, %xmm0, %xmm0
retq

I would greatly appreciate it if you could help me understand why the position of xmm is changing in the output.

SahilPatidar avatar Mar 05 '24 18:03 SahilPatidar

The order looks to just be the difference between Intel and AT&T ASM ayntax

RKSimon avatar Mar 05 '24 19:03 RKSimon

@RKSimon, I think some changes need to be made in select and logical shift operations.

SahilPatidar avatar Mar 06 '24 09:03 SahilPatidar

I suggest you just focus on the SRA case first

RKSimon avatar Mar 06 '24 10:03 RKSimon

@RKSimon, could the code I shared above be improved? If it's incorrect, please let me know.

SahilPatidar avatar Mar 06 '24 10:03 SahilPatidar

That SRA code looks fine - I'd like to see it in a PR, and then there is the question of adding test coverage....

RKSimon avatar Mar 06 '24 11:03 RKSimon

@RKSimon, Do tests need to be changed in combine-sra.ll?

SahilPatidar avatar Mar 06 '24 17:03 SahilPatidar

@RKSimon, Do tests need to be changed in combine-sra.ll?

Yes

RKSimon avatar Mar 06 '24 17:03 RKSimon

ISD::SRA handling landed at #84426 / 186565513c57cd625ea7afd7b33897adfed7e9f8

RKSimon avatar Mar 15 '24 16:03 RKSimon

@RKSimon, it seems SRL also needs to be addressed. Is that correct?

SahilPatidar avatar Mar 18 '24 12:03 SahilPatidar

Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes.

RKSimon avatar Mar 18 '24 18:03 RKSimon

Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes.

could you please provide an example for AND?

SahilPatidar avatar Mar 23 '24 10:03 SahilPatidar

This was the kind of thing I had in mind: https://gcc.godbolt.org/z/rdh9j3ocW

RKSimon avatar Mar 24 '24 09:03 RKSimon