llvm-project
llvm-project copied to clipboard
[X86][AVX] Recognise out of bounds AVX2 shift amounts
Pulled out of #39822 which was a bit too general.
Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:
- VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
- VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).
So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.
e.g.
define <4 x i32> @ashr(<4 x i32> %sh, <4 x i32> %amt) {
%elt.min.i = tail call <4 x i32> @llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
%shr = ashr <4 x i32> %sh, %elt.min.i
ret <4 x i32> %shr
}
->
ashr(int vector[4], unsigned int vector[4]):
vpbroadcastd xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
vpminud xmm1, xmm1, xmm2
vpsravd xmm0, xmm0, xmm1
ret
vs
ashr(int vector[4], unsigned int vector[4]):
vpsravd xmm0, xmm0, xmm1
ret
Logical shifts are trickier but also foldable:
define <4 x i32> @lshr(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%shr = lshr <4 x i32> %sh, %amt
%0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
ret <4 x i32> %0
}
define <4 x i32> @lshr2(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
%shr = lshr <4 x i32> %0, %amt
ret <4 x i32> %shr
}
@llvm/issue-subscribers-backend-x86
Author: Simon Pilgrim (RKSimon)
Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:
- VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
- VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).
So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.
e.g.
define <4 x i32> @<!-- -->ashr(<4 x i32> %sh, <4 x i32> %amt) {
%elt.min.i = tail call <4 x i32> @<!-- -->llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
%shr = ashr <4 x i32> %sh, %elt.min.i
ret <4 x i32> %shr
}
->
ashr(int vector[4], unsigned int vector[4]):
vpbroadcastd xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
vpminud xmm1, xmm1, xmm2
vpsravd xmm0, xmm0, xmm1
ret
vs
ashr(int vector[4], unsigned int vector[4]):
vpsravd xmm0, xmm0, xmm1
ret
Logical shifts are trickier but also foldable:
define <4 x i32> @<!-- -->lshr(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%shr = lshr <4 x i32> %sh, %amt
%0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
ret <4 x i32> %0
}
define <4 x i32> @<!-- -->lshr2(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
%shr = lshr <4 x i32> %0, %amt
ret <4 x i32> %shr
}
Hi!
This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
- In the comments of the issue, request for it to be assigned to you.
- Fix the issue locally.
- Run the test suite locally. Remember that the subdirectories under
test/create fine-grained testing targets, so you can e.g. usemake check-clang-astto only run Clang's AST tests. - Create a Git commit.
- Run
git clang-format HEAD~1to format your changes. - Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.
If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.
@llvm/issue-subscribers-good-first-issue
Author: Simon Pilgrim (RKSimon)
Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:
- VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
- VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).
So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.
e.g.
define <4 x i32> @<!-- -->ashr(<4 x i32> %sh, <4 x i32> %amt) {
%elt.min.i = tail call <4 x i32> @<!-- -->llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
%shr = ashr <4 x i32> %sh, %elt.min.i
ret <4 x i32> %shr
}
->
ashr(int vector[4], unsigned int vector[4]):
vpbroadcastd xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
vpminud xmm1, xmm1, xmm2
vpsravd xmm0, xmm0, xmm1
ret
vs
ashr(int vector[4], unsigned int vector[4]):
vpsravd xmm0, xmm0, xmm1
ret
Logical shifts are trickier but also foldable:
define <4 x i32> @<!-- -->lshr(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%shr = lshr <4 x i32> %sh, %amt
%0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
ret <4 x i32> %0
}
define <4 x i32> @<!-- -->lshr2(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
%shr = lshr <4 x i32> %0, %amt
ret <4 x i32> %shr
}
https://godbolt.org/z/W1r1c4fzY
@RKSimon, I'm interested in working on this issue. Could you confirm if all the mentioned behavior regarding the AVX2 vector shift nodes (X86ISD VSRAV/VSRLV/VSHLV) and the handling of out-of-bounds shift amounts occurs in the llvm/lib/Target/X86 directory?
I expect all of this to be handled as DAG combines inside X86ISelLowering.cpp
What command did you use for this?
@RKSimon, I am working on it, so could you please assign it to me?
I suggest you start with the SRA(X,UMIN(Y,BW-1)) case in combineShiftRightArithmetic
You should be able to use supportedVectorVarShift to detect when the target supports replacing the ISD::SRA node with X86ISD::VSRAV
@RKSimon, IWhile reviewing the X86ISelLowering codebase, I encountered a scenario where the code executes the SRA operation and calls the combineShiftRightArithmetic function. However, when the following condition is met:
if (VT.isVector() || N1.getOpcode() != ISD::Constant ||
N0.getOpcode() != ISD::SHL || !N0.hasOneUse() ||
N0.getOperand(1).getOpcode() != ISD::Constant)
return SDValue();
the code returns without performing any optimization.
My observation is that even though we are working with vectors, the code does not invoke the combineVectorShiftImm function in these cases.
I would like to understand why the code behaves this way and does not call the combineVectorShiftImm function when dealing with vectors. Could you please provide some insights into this behavior?
The code below that early-out is for a single fold (see the description comment) - just above it is the combineShiftToPMULH fold - you can put your code after combineShiftToPMULH.
@RKSimon I have been working on a particular piece of code in LLVM and encountered an unexpected behavior that I'm seeking clarification on.
APInt ShiftAmt;
SDNode *NNode = N1.getNode();
if (NNode->getOpcode() == ISD::UMIN &&
ISD::isConstantSplatVector(NNode->getOperand(1).getNode(), ShiftAmt) &&
ShiftAmt == VT.getScalarSizeInBits() - 1) {
SDValue SHAmtVal = NNode->getOperand(0);
SDLoc DL(N);
return DAG.getNode(N->getOpcode(), DL, N->getVTList(), N0, SHAmtVal);
}
When running this code, I observed the following unexpected output:
vpsravd %xmm1, %xmm0, %xmm0
retq
I would greatly appreciate it if you could help me understand why the position of xmm is changing in the output.
The order looks to just be the difference between Intel and AT&T ASM ayntax
@RKSimon, I think some changes need to be made in select and logical shift operations.
I suggest you just focus on the SRA case first
@RKSimon, could the code I shared above be improved? If it's incorrect, please let me know.
That SRA code looks fine - I'd like to see it in a PR, and then there is the question of adding test coverage....
@RKSimon, Do tests need to be changed in combine-sra.ll?
@RKSimon, Do tests need to be changed in combine-sra.ll?
Yes
ISD::SRA handling landed at #84426 / 186565513c57cd625ea7afd7b33897adfed7e9f8
@RKSimon, it seems SRL also needs to be addressed. Is that correct?
Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes.
Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes.
could you please provide an example for AND?
This was the kind of thing I had in mind: https://gcc.godbolt.org/z/rdh9j3ocW