llvm-project
llvm-project copied to clipboard
riscv 64-bit popcount uses inefficient constant materialization
Consider:
int a(unsigned long long x) { return __builtin_popcountll(x); }
Targeting rv64, this generates:
a:
srli a1, a0, 1
lui a2, 349525
addiw a2, a2, 1365
slli a3, a2, 32
add a2, a2, a3
and a1, a1, a2
sub a0, a0, a1
lui a1, 209715
addiw a1, a1, 819
slli a2, a1, 32
add a1, a1, a2
and a2, a0, a1
srli a0, a0, 2
and a0, a0, a1
add a0, a0, a2
srli a1, a0, 4
add a0, a0, a1
lui a1, 61681
addiw a1, a1, -241
slli a2, a1, 32
add a1, a1, a2
and a0, a0, a1
lui a1, 4112
addiw a1, a1, 257
slli a2, a1, 32
add a1, a1, a2
mul a0, a0, a1
srli a0, a0, 56
ret
There are 4 constant integers involved in this computation: 0x5555555555555555, 0x3333333333333333, 0x0F0F0F0F0F0F0F0F, and 0x0101010101010101. The way we're materializing the constants is not efficient. In isolation, each of these takes 4 instructions to materialize, which I think is optimal... but the constants are related to each other. 0x3333333333333333 == (0x0F0F0F0F0F0F0F0F ^ (0x0F0F0F0F0F0F0F0F << 2)). 0x5555555555555555 == (0x3333333333333333 ^ (0x3333333333333333 << 1)). 0x0101010101010101 == (0x0F0F0F0F0F0F0F0F & (0x0F0F0F0F0F0F0F0F >> 3)).
@llvm/issue-subscribers-backend-risc-v
Author: Eli Friedman (efriedma-quic)
int a(unsigned long long x) { return __builtin_popcountll(x); }
Targeting rv64, this generates:
a:
srli a1, a0, 1
lui a2, 349525
addiw a2, a2, 1365
slli a3, a2, 32
add a2, a2, a3
and a1, a1, a2
sub a0, a0, a1
lui a1, 209715
addiw a1, a1, 819
slli a2, a1, 32
add a1, a1, a2
and a2, a0, a1
srli a0, a0, 2
and a0, a0, a1
add a0, a0, a2
srli a1, a0, 4
add a0, a0, a1
lui a1, 61681
addiw a1, a1, -241
slli a2, a1, 32
add a1, a1, a2
and a0, a0, a1
lui a1, 4112
addiw a1, a1, 257
slli a2, a1, 32
add a1, a1, a2
mul a0, a0, a1
srli a0, a0, 56
ret
There are 4 constant integers involved in this computation: 0x5555555555555555, 0x3333333333333333, 0x0F0F0F0F0F0F0F0F, and 0x0101010101010101. The way we're materializing the constants is not efficient. In isolation, each of these takes 4 instructions to materialize, which I think is optimal... but the constants are related to each other. 0x3333333333333333 == (0x0F0F0F0F0F0F0F0F ^ (0x0F0F0F0F0F0F0F0F << 2)). 0x5555555555555555 == (0x3333333333333333 ^ (0x3333333333333333 << 1)). 0x0101010101010101 == (0x0F0F0F0F0F0F0F0F & (0x0F0F0F0F0F0F0F0F >> 3)).
When materializing constants, we can't know its context and it's hard to know the connection of constants. The solution may be using your formulas instead of using APInt directly here: https://github.com/llvm/llvm-project/blob/72c729f354d71697a1402720c90b57ff521b6739/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp#L8669-L8676 But some targets may be able to materialize these constants easily, so I think this should be a custom lowering in RISCV target.
You could write a generic pass that collects all the constants in a block and checks whether one constant can be produced using a shift+xor of another constant. Not sure how generally useful such a pass would be.
Interesting idea, but is it computationally feasible to find bitwise relations between constants? I mean, for example, the constants used in bit count algorithms can be genericized to these, using division:
uintN_t mask1 = ((uintN_t)-1 / 0xFF) * 0x55;
uintN_t mask2 = ((uintN_t)-1 / 0xFF) * 0x33;
uintN_t mask4 = ((uintN_t)-1 / 0xFF) * 0x0F;
uintN_t multiplier = ((uintN_t)-1 / 0xFF);
And we should be not restricted to bitwise operations only to derive constants like these, and hence, if multiplications are allowed, I can use 3 multiplications instead of 6 bitwise operations you suggested to derive all necessary constants. It comes to the question of: Are these simplifications worth it, especially regarding the general optimization levels ("-O2" and "-Os")?