rav1e In x86inc, `pxor m0, m0` is translated inefficiently

In x86inc, `pxor m0, m0` is translated inefficiently

Open redzic opened this issue 2 years ago • 4 comments

It seems like in x86inc.asm, pxor m0, m0 gets translated into vpxor ymm0, ymm0 when mmsize == 32. This is a bit inefficient because vpxor xmm0, xmm0 does the same thing and is preferred, because although XORing xmm0 or ymm0 are both 4 byte instructions, on AMD CPUs before Zen 2 the ymm version takes 2 extra uops, and the EVEX version of the instruction actually does take extra bytes to encode.

Not exactly sure if this should be changed in x86inc itself or if all places in the code that do this should fix it themselves by using xm0 instead of m0 (there are surprisingly quite a few places that don't do this).

May 16 '22 17:05 redzic

This might be intentional as a way to avoid false dependencies "for free". That said, it obviously isn't actually free. Might be worth checking if the common places were this is used already have a vzeroupper or other previous zeroing.

May 16 '22 19:05 tdaede

vpxor xmm0, xmm0 also zeroes out the upper bits in ymm0 because of implicit zeroing, similar to how xor eax, eax clears the upper bits of rax. So I think the xmm version should basically always be used.

May 17 '22 04:05 redzic

OK, I missed it was still VEX coded. I think patching x86inc.asm (and sending a patch upstream if desired) makes the most sense. I can't think of any other side effects it would have.

May 17 '22 07:05 tdaede

Yes, this is a valid optimization.

I actually considered implementing this in x86inc.asm in the past but never got around to doing it.

May 20 '22 11:05 Gramner

rav1e rav1e copied to clipboard

In x86inc, `pxor m0, m0` is translated inefficiently

rav1e
rav1e copied to clipboard