llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

DAGCombine carry folding makes code worse in given testcase

Open efriedma-quic opened this issue 1 year ago • 1 comments

Testcase derived from boringssl: boringssl-bcm-testcase.txt. The code is sort of complicated, but I'm not sure how to simplify it.

I've found that if I revert D64190 (e9aed963ce36) and D57302 (bddb8c359739), the generated code for the given testcase is significantly shorter on both 32-bit x86 and 32-bit ARM. The effect is a bit more obvious on ARM, where the mrs/msr operations to spill the carry bit are more distinct, but it's significant either way.

I think what's happening is that the we do some combines that seem to improve the code locally, but end up forcing the scheduler to spill the carry bit, which then makes the code worse overall. Or maybe the scheduler just doesn't realize it should avoid spilling the carry bit? Not quite sure which.

64-bit targets don't have the same issue on the given testcase, but I assume that's because of the integer widths involved, not because 64-bit targets are somehow immune to the issue.

On a related note, we might want to teach the ARM backend to use a different sequence to save/restore the carry bit, if we expect carry spills to show up in practical code. msr/mrs appear to be slower than other ways of accessing the carry bit.

CC @topperc @RKSimon @deadalnix

efriedma-quic avatar Aug 15 '22 19:08 efriedma-quic

@llvm/issue-subscribers-backend-arm

llvmbot avatar Aug 15 '22 19:08 llvmbot

CC @chfast

RKSimon avatar Aug 19 '22 08:08 RKSimon

Is this regression between 14 and 15? I don't see big difference in length. https://godbolt.org/z/Mav8sc5qq

Any ideas how to reduce the test case?

chfast avatar Aug 19 '22 13:08 chfast

This isn't a recent regression, no. Was reported to me against an old version, and didn't get around to reporting it here for a while. Compare against clang-8.

I don't have any great ideas for reducing this; it's fundamentally tied to scheduling behavior, which makes any reduction sort of tricky. I guess you could try to creduce for the smallest testcase that still generates mrs; might not be exactly the same thing as the original, but should at least give some idea.

efriedma-quic avatar Aug 19 '22 18:08 efriedma-quic