In x86inc, `pxor m0, m0` is translated inefficiently
It seems like in x86inc.asm, pxor m0, m0 gets translated into vpxor ymm0, ymm0 when mmsize == 32. This is a bit inefficient because vpxor xmm0, xmm0 does the same thing and is preferred, because although XORing xmm0 or ymm0 are both 4 byte instructions, on AMD CPUs before Zen 2 the ymm version takes 2 extra uops, and the EVEX version of the instruction actually does take extra bytes to encode.
Not exactly sure if this should be changed in x86inc itself or if all places in the code that do this should fix it themselves by using xm0 instead of m0 (there are surprisingly quite a few places that don't do this).
This might be intentional as a way to avoid false dependencies "for free". That said, it obviously isn't actually free. Might be worth checking if the common places were this is used already have a vzeroupper or other previous zeroing.
vpxor xmm0, xmm0 also zeroes out the upper bits in ymm0 because of implicit zeroing, similar to how xor eax, eax clears the upper bits of rax. So I think the xmm version should basically always be used.
OK, I missed it was still VEX coded. I think patching x86inc.asm (and sending a patch upstream if desired) makes the most sense. I can't think of any other side effects it would have.
Yes, this is a valid optimization.
I actually considered implementing this in x86inc.asm in the past but never got around to doing it.