Add vzeroupper support for x86
Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software Optimization manual, "if software inter-mixes AVX and SSE instructions without using VZEROUPPER properly, it can experience an AVX/SSE transition penalty."
This patch add support for vzeroupper so JIT can emit this instruction and place it in correct place.
I am not sure if such a single patch, add support for one instruction on x86, can be a PR. However, hope I can ramp up the repo and contribute to AVX2 code generation/emit in coming months.
This kind of thing does not really fit to the concept of a generic jit compiler, because it more of a special casing for a specific issue. It would be interesting to explore other options as well, such as zeroing the target registers directly with a xor operation, since it could be generated without any extra api call. It is also a big question for me whether avx will ever be faster than sse2. If not, this direction does not really worth the effort.
something that might be interesting would be to mix this logic with some other CPU specific codepaths that could be dynamically patched in.
I am sure (for example) that in really modern CPUs with a highly performant AVX-512 circuit the AVX2 code should be able to perform better than SSE2.
Indeed once CPUs with the next variable vector implementation (AVX10) are out, that dynamically changes the vector size, using AVX would be definitely be faster as it will allow also for AVX-512 to work without any changes on the code.
For the next release I decided to use the SSE2 code path. But it would be good to improve the vector use. I tried some methods, such as zeroing the 256 bit register before using it, but it was still slow. And moving the vzeroupper to somewhere else also has no effect. Overall I don't really understand what is exactly happening here, which makes hard to maintain / redesign the code. If feels "fragile", and if you want to play around with some new idea, you might just break it, and you don't even understand why. On the long run we need to understand how the cpu thinks, and how can we exploit it.