llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

[RISC-V][X86] Bitwise NOT is computed twice

Open Validark opened this issue 1 year ago • 8 comments

I have this Zig code:

export fn count_zero_groups(x: u64) u64 {
    return @popCount(~x & ~(~x << 1));
}

For the sifive-x280 CPU, we get this emit (Godbolt):

count_zero_groups:
        not     a1, a0
        slli    a1, a1, 1
        not     a1, a1
        andn    a0, a1, a0
        cpop    a0, a0
        ret

As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:

count_zero_groups:
        not     a1, a0
        slli    a2, a1, 1
        andn    a0, a1, a2
        cpop    a0, a0
        ret

Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.

Note: this optimization already works properly on x86_64 machines:

count_zero_groups:
        not     rdi
        lea     rax, [rdi + rdi]
        andn    rax, rax, rdi
        popcnt  rax, rax
        ret

Validark avatar Mar 19 '24 20:03 Validark

X86 gets it right by accident. By manually commuting the operand in IR, I can get it to generate the same issue as RISC-V https://godbolt.org/z/M1sYYv7Tr

topperc avatar Mar 19 '24 20:03 topperc

X86 gets it right by accident. By manually commuting the operand in IR, I can get it to generate the same issue as RISC-V https://godbolt.org/z/M1sYYv7Tr

X86 tries ANDN on operand 0 first because that's the operand of the instruction that is inverted. RISC-V tries operand 1 first becuase that's the inverted operand of its instruction.

topperc avatar Mar 19 '24 20:03 topperc

@llvm/issue-subscribers-backend-risc-v

Author: Niles Salter (Validark)

I have this Zig code:
export fn count_zero_groups(x: u64) u64 {
    return @<!-- -->popCount(~x &amp; ~(~x &lt;&lt; 1));
}

For the sifive-x280 CPU, we get this emit (Godbolt):

count_zero_groups:
        not     a1, a0
        slli    a1, a1, 1
        not     a1, a1
        andn    a0, a1, a0
        cpop    a0, a0
        ret

As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:

count_zero_groups:
        not     a1, a0
        slli    a2, a1, 1
        andn    a0, a1, a2
        cpop    a0, a0
        ret

Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.

Note: this optimization already works properly on x86_64 machines:

count_zero_groups:
        not     rdi
        lea     rax, [rdi + rdi]
        andn    rax, rax, rdi
        popcnt  rax, rax
        ret

llvmbot avatar Mar 19 '24 21:03 llvmbot

@EugeneZelenko Based on @topperc's analysis, this is not an issue with the RISC-V backend, correct?

Validark avatar Mar 19 '24 21:03 Validark

@EugeneZelenko Based on @topperc's analysis, this is not an issue with the RISC-V backend, correct?

I think it's an issue with RISC-V backend and the X86 backend. I don't think there is a generic fix.

topperc avatar Mar 19 '24 21:03 topperc

@llvm/issue-subscribers-backend-risc-v

Author: Niles Salter (Validark)

I have this Zig code:
export fn count_zero_groups(x: u64) u64 {
    return @<!-- -->popCount(~x &amp; ~(~x &lt;&lt; 1));
}

For the sifive-x280 CPU, we get this emit (Godbolt):

count_zero_groups:
        not     a1, a0
        slli    a1, a1, 1
        not     a1, a1
        andn    a0, a1, a0
        cpop    a0, a0
        ret

As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:

count_zero_groups:
        not     a1, a0
        slli    a2, a1, 1
        andn    a0, a1, a2
        cpop    a0, a0
        ret

Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.

Note: this optimization already works properly on x86_64 machines:

count_zero_groups:
        not     rdi
        lea     rax, [rdi + rdi]
        andn    rax, rax, rdi
        popcnt  rax, rax
        ret

llvmbot avatar Mar 20 '24 08:03 llvmbot

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

I have this Zig code:
export fn count_zero_groups(x: u64) u64 {
    return @<!-- -->popCount(~x &amp; ~(~x &lt;&lt; 1));
}

For the sifive-x280 CPU, we get this emit (Godbolt):

count_zero_groups:
        not     a1, a0
        slli    a1, a1, 1
        not     a1, a1
        andn    a0, a1, a0
        cpop    a0, a0
        ret

As you can see, we are doing a NOT on a0 in the first instruction, as well as in the andn. In this case, this is a problem because we could have folded the not a1, a1 into the andn instead:

count_zero_groups:
        not     a1, a0
        slli    a2, a1, 1
        andn    a0, a1, a2
        cpop    a0, a0
        ret

Also worth noting that we can change all the a2's to a0 in the above code. This is because we don't need x anymore, once we have ~x.

Note: this optimization already works properly on x86_64 machines:

count_zero_groups:
        not     rdi
        lea     rax, [rdi + rdi]
        andn    rax, rax, rdi
        popcnt  rax, rax
        ret

llvmbot avatar Mar 20 '24 08:03 llvmbot

It's unfortunate that there is no generic fix, since there seems to be the same issue on these machines too:

Sparc: https://godbolt.org/z/rP5P4TxTq Power: https://godbolt.org/z/bxxTMx451

AndNot is not that uncommon. Some machines even have other similar instructions like OrNot. Can this really not be dealt with in a way where multiple architectures can use the same logic?

Validark avatar Mar 24 '24 14:03 Validark