llvm-project riscv 64-bit popcount uses inefficient constant materialization

Consider:

int a(unsigned long long x) { return __builtin_popcountll(x); }

Targeting rv64, this generates:

a:
        srli    a1, a0, 1
        lui     a2, 349525
        addiw   a2, a2, 1365
        slli    a3, a2, 32
        add     a2, a2, a3
        and     a1, a1, a2
        sub     a0, a0, a1
        lui     a1, 209715
        addiw   a1, a1, 819
        slli    a2, a1, 32
        add     a1, a1, a2
        and     a2, a0, a1
        srli    a0, a0, 2
        and     a0, a0, a1
        add     a0, a0, a2
        srli    a1, a0, 4
        add     a0, a0, a1
        lui     a1, 61681
        addiw   a1, a1, -241
        slli    a2, a1, 32
        add     a1, a1, a2
        and     a0, a0, a1
        lui     a1, 4112
        addiw   a1, a1, 257
        slli    a2, a1, 32
        add     a1, a1, a2
        mul     a0, a0, a1
        srli    a0, a0, 56
        ret

There are 4 constant integers involved in this computation: 0x5555555555555555, 0x3333333333333333, 0x0F0F0F0F0F0F0F0F, and 0x0101010101010101. The way we're materializing the constants is not efficient. In isolation, each of these takes 4 instructions to materialize, which I think is optimal... but the constants are related to each other. 0x3333333333333333 == (0x0F0F0F0F0F0F0F0F ^ (0x0F0F0F0F0F0F0F0F << 2)). 0x5555555555555555 == (0x3333333333333333 ^ (0x3333333333333333 << 1)). 0x0101010101010101 == (0x0F0F0F0F0F0F0F0F & (0x0F0F0F0F0F0F0F0F >> 3)).

Mar 21 '24 22:03 efriedma-quic

@llvm/issue-subscribers-backend-risc-v

Author: Eli Friedman (efriedma-quic)

Consider:

int a(unsigned long long x) { return __builtin_popcountll(x); }

Targeting rv64, this generates:

a:
        srli    a1, a0, 1
        lui     a2, 349525
        addiw   a2, a2, 1365
        slli    a3, a2, 32
        add     a2, a2, a3
        and     a1, a1, a2
        sub     a0, a0, a1
        lui     a1, 209715
        addiw   a1, a1, 819
        slli    a2, a1, 32
        add     a1, a1, a2
        and     a2, a0, a1
        srli    a0, a0, 2
        and     a0, a0, a1
        add     a0, a0, a2
        srli    a1, a0, 4
        add     a0, a0, a1
        lui     a1, 61681
        addiw   a1, a1, -241
        slli    a2, a1, 32
        add     a1, a1, a2
        and     a0, a0, a1
        lui     a1, 4112
        addiw   a1, a1, 257
        slli    a2, a1, 32
        add     a1, a1, a2
        mul     a0, a0, a1
        srli    a0, a0, 56
        ret

There are 4 constant integers involved in this computation: 0x5555555555555555, 0x3333333333333333, 0x0F0F0F0F0F0F0F0F, and 0x0101010101010101. The way we're materializing the constants is not efficient. In isolation, each of these takes 4 instructions to materialize, which I think is optimal... but the constants are related to each other. 0x3333333333333333 == (0x0F0F0F0F0F0F0F0F ^ (0x0F0F0F0F0F0F0F0F << 2)). 0x5555555555555555 == (0x3333333333333333 ^ (0x3333333333333333 << 1)). 0x0101010101010101 == (0x0F0F0F0F0F0F0F0F & (0x0F0F0F0F0F0F0F0F >> 3)).

Mar 21 '24 22:03 llvmbot

When materializing constants, we can't know its context and it's hard to know the connection of constants. The solution may be using your formulas instead of using APInt directly here: https://github.com/llvm/llvm-project/blob/72c729f354d71697a1402720c90b57ff521b6739/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp#L8669-L8676 But some targets may be able to materialize these constants easily, so I think this should be a custom lowering in RISCV target.

Mar 22 '24 17:03 wangpc-pp

You could write a generic pass that collects all the constants in a block and checks whether one constant can be produced using a shift+xor of another constant. Not sure how generally useful such a pass would be.

Mar 22 '24 19:03 efriedma-quic

Interesting idea, but is it computationally feasible to find bitwise relations between constants? I mean, for example, the constants used in bit count algorithms can be genericized to these, using division:

uintN_t mask1 = ((uintN_t)-1 / 0xFF) * 0x55;
uintN_t mask2 = ((uintN_t)-1 / 0xFF) * 0x33;
uintN_t mask4 = ((uintN_t)-1 / 0xFF) * 0x0F;
uintN_t multiplier = ((uintN_t)-1 / 0xFF);

And we should be not restricted to bitwise operations only to derive constants like these, and hence, if multiplications are allowed, I can use 3 multiplications instead of 6 bitwise operations you suggested to derive all necessary constants. It comes to the question of: Are these simplifications worth it, especially regarding the general optimization levels ("-O2" and "-Os")?

Mar 24 '24 14:03 Explorer09

llvm-project llvm-project copied to clipboard

riscv 64-bit popcount uses inefficient constant materialization

llvm-project
llvm-project copied to clipboard