llvm-project
llvm-project copied to clipboard
[RISC-V][X86] Bitwise NOT is computed twice
I have this Zig code:
export fn count_zero_groups(x: u64) u64 {
return @popCount(~x & ~(~x << 1));
}
For the sifive-x280 CPU, we get this emit (Godbolt):
count_zero_groups:
not a1, a0
slli a1, a1, 1
not a1, a1
andn a0, a1, a0
cpop a0, a0
ret
As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:
count_zero_groups:
not a1, a0
slli a2, a1, 1
andn a0, a1, a2
cpop a0, a0
ret
Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.
Note: this optimization already works properly on x86_64 machines:
count_zero_groups:
not rdi
lea rax, [rdi + rdi]
andn rax, rax, rdi
popcnt rax, rax
ret
X86 gets it right by accident. By manually commuting the operand in IR, I can get it to generate the same issue as RISC-V https://godbolt.org/z/M1sYYv7Tr
X86 gets it right by accident. By manually commuting the operand in IR, I can get it to generate the same issue as RISC-V https://godbolt.org/z/M1sYYv7Tr
X86 tries ANDN on operand 0 first because that's the operand of the instruction that is inverted. RISC-V tries operand 1 first becuase that's the inverted operand of its instruction.
@llvm/issue-subscribers-backend-risc-v
Author: Niles Salter (Validark)
export fn count_zero_groups(x: u64) u64 {
return @<!-- -->popCount(~x & ~(~x << 1));
}
For the sifive-x280 CPU, we get this emit (Godbolt):
count_zero_groups:
not a1, a0
slli a1, a1, 1
not a1, a1
andn a0, a1, a0
cpop a0, a0
ret
As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:
count_zero_groups:
not a1, a0
slli a2, a1, 1
andn a0, a1, a2
cpop a0, a0
ret
Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.
Note: this optimization already works properly on x86_64 machines:
count_zero_groups:
not rdi
lea rax, [rdi + rdi]
andn rax, rax, rdi
popcnt rax, rax
ret
@EugeneZelenko Based on @topperc's analysis, this is not an issue with the RISC-V backend, correct?
@EugeneZelenko Based on @topperc's analysis, this is not an issue with the RISC-V backend, correct?
I think it's an issue with RISC-V backend and the X86 backend. I don't think there is a generic fix.
@llvm/issue-subscribers-backend-risc-v
Author: Niles Salter (Validark)
export fn count_zero_groups(x: u64) u64 {
return @<!-- -->popCount(~x & ~(~x << 1));
}
For the sifive-x280 CPU, we get this emit (Godbolt):
count_zero_groups:
not a1, a0
slli a1, a1, 1
not a1, a1
andn a0, a1, a0
cpop a0, a0
ret
As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:
count_zero_groups:
not a1, a0
slli a2, a1, 1
andn a0, a1, a2
cpop a0, a0
ret
Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.
Note: this optimization already works properly on x86_64 machines:
count_zero_groups:
not rdi
lea rax, [rdi + rdi]
andn rax, rax, rdi
popcnt rax, rax
ret
@llvm/issue-subscribers-backend-x86
Author: Niles Salter (Validark)
export fn count_zero_groups(x: u64) u64 {
return @<!-- -->popCount(~x & ~(~x << 1));
}
For the sifive-x280 CPU, we get this emit (Godbolt):
count_zero_groups:
not a1, a0
slli a1, a1, 1
not a1, a1
andn a0, a1, a0
cpop a0, a0
ret
As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:
count_zero_groups:
not a1, a0
slli a2, a1, 1
andn a0, a1, a2
cpop a0, a0
ret
Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.
Note: this optimization already works properly on x86_64 machines:
count_zero_groups:
not rdi
lea rax, [rdi + rdi]
andn rax, rax, rdi
popcnt rax, rax
ret
It's unfortunate that there is no generic fix, since there seems to be the same issue on these machines too:
Sparc: https://godbolt.org/z/rP5P4TxTq Power: https://godbolt.org/z/bxxTMx451
AndNot is not that uncommon. Some machines even have other similar instructions like OrNot. Can this really not be dealt with in a way where multiple architectures can use the same logic?