rav1e
rav1e copied to clipboard
In x86inc, `pxor m0, m0` is translated inefficiently
It seems like in x86inc.asm, pxor m0, m0
gets translated into vpxor ymm0, ymm0
when mmsize == 32
. This is a bit inefficient because vpxor xmm0, xmm0
does the same thing and is preferred, because although XORing xmm0 or ymm0 are both 4 byte instructions, on AMD CPUs before Zen 2 the ymm version takes 2 extra uops, and the EVEX version of the instruction actually does take extra bytes to encode.
Not exactly sure if this should be changed in x86inc itself or if all places in the code that do this should fix it themselves by using xm0
instead of m0
(there are surprisingly quite a few places that don't do this).
This might be intentional as a way to avoid false dependencies "for free". That said, it obviously isn't actually free. Might be worth checking if the common places were this is used already have a vzeroupper or other previous zeroing.
vpxor xmm0, xmm0
also zeroes out the upper bits in ymm0 because of implicit zeroing, similar to how xor eax, eax
clears the upper bits of rax. So I think the xmm version should basically always be used.
OK, I missed it was still VEX coded. I think patching x86inc.asm (and sending a patch upstream if desired) makes the most sense. I can't think of any other side effects it would have.
Yes, this is a valid optimization.
I actually considered implementing this in x86inc.asm
in the past but never got around to doing it.