pygame-ce
pygame-ce copied to clipboard
Refactor AVX2 blitters using macros
This moves code repetition out of the blitter functions themselves into the macros developed for running the new AVX2 alpha blitters. Also adjusted a couple set_ instructions (like _mm256_set_epi16) to use set1_ variants instead so the value can just be provided once.
Makes simd_blitters_avx2 800 lines shorter.
I believe this has a small performance regression, which could be solved by rewiring the RUN_AVX2_BLITTER implementation strategy. Maybe we've got to put the DUFF loops back in there?
I don't know if you want to move this forward or what but i have done a lot of testing and have some findings to share. Basically the LOOP_UNROLLED isn't improving the situation much, but what's making it perform sligtly worse is the maskload/maskstore intructions compared to going pixel by pixel with something like this:
LOOP_UNROLLED4(
{
mm_src = _mm_cvtsi32_si128(*srcp);
mm_dst = _mm_cvtsi32_si128(*dstp);
{BLIT_CODE}
*dstp = _mm_cvtsi128_si32(mm_dst);
srcp += srcpxskip;
dstp += dstpxskip;
},
n, pre_8_width);
If you take a look at the latency/CPI of these instructions and sum them up you can see that it's faster if we're talking like 1-3/4 excess pixels but it's slower for 4-7 pixels, hence the maskload is better in that case.
I've managed to come up with another strategy that does two pixels at a time with _mm_loadl_epi64 so that we shave off some cycles and does a final pixel if there with _mm_cvtsi32_si128 and this was better all the time. Downside to that is that we'd have to pass in an AVX and an SSE blit code all the time.
Anyway this implementation could still use some simplifications like we don't actually need two shuffle masks for the 16 bit shuffle macro since we can directly use _mm256_unpacklo_epi8 / _mm256_unpackhi_epi8 with a zeroed register and get the same result. Which is what i did in #2642.
But I'd really encourage you to revive this PR if possible or maybe consider working on this together? Your choice.
Hey itz, I would like to resolve this as well.
All my efforts to find a way out for performance failed earlier this year, but I think a sub-5% performance regression is acceptable for blend modes. I wouldn't want to do that to the PREMUL blend, but for the others I think it would be acceptable. They're all so fast anyway. A cleaner and more understandable codebase has its own benefits.
I'd like to avoid interleaving SSE2 and AVX code, again for the simplicity of the codebase.
I've let this PR linger too long, I think it should be closed. It can be revisited in the future if necessary.