pygame-ce icon indicating copy to clipboard operation
pygame-ce copied to clipboard

Refactor AVX2 blitters using macros

Open Starbuck5 opened this issue 2 years ago • 4 comments
trafficstars

This moves code repetition out of the blitter functions themselves into the macros developed for running the new AVX2 alpha blitters. Also adjusted a couple set_ instructions (like _mm256_set_epi16) to use set1_ variants instead so the value can just be provided once.

Makes simd_blitters_avx2 800 lines shorter.

I believe this has a small performance regression, which could be solved by rewiring the RUN_AVX2_BLITTER implementation strategy. Maybe we've got to put the DUFF loops back in there?

Starbuck5 avatar Mar 26 '23 05:03 Starbuck5

I don't know if you want to move this forward or what but i have done a lot of testing and have some findings to share. Basically the LOOP_UNROLLED isn't improving the situation much, but what's making it perform sligtly worse is the maskload/maskstore intructions compared to going pixel by pixel with something like this:

LOOP_UNROLLED4(
      {
          mm_src = _mm_cvtsi32_si128(*srcp);
          mm_dst = _mm_cvtsi32_si128(*dstp);

          {BLIT_CODE}

          *dstp = _mm_cvtsi128_si32(mm_dst);

          srcp += srcpxskip;
          dstp += dstpxskip;
      },
      n, pre_8_width);

If you take a look at the latency/CPI of these instructions and sum them up you can see that it's faster if we're talking like 1-3/4 excess pixels but it's slower for 4-7 pixels, hence the maskload is better in that case. I've managed to come up with another strategy that does two pixels at a time with _mm_loadl_epi64 so that we shave off some cycles and does a final pixel if there with _mm_cvtsi32_si128 and this was better all the time. Downside to that is that we'd have to pass in an AVX and an SSE blit code all the time.

itzpr3d4t0r avatar Jan 02 '24 18:01 itzpr3d4t0r

Anyway this implementation could still use some simplifications like we don't actually need two shuffle masks for the 16 bit shuffle macro since we can directly use _mm256_unpacklo_epi8 / _mm256_unpackhi_epi8 with a zeroed register and get the same result. Which is what i did in #2642.

itzpr3d4t0r avatar Jan 02 '24 18:01 itzpr3d4t0r

But I'd really encourage you to revive this PR if possible or maybe consider working on this together? Your choice.

itzpr3d4t0r avatar Jan 02 '24 18:01 itzpr3d4t0r

Hey itz, I would like to resolve this as well.

All my efforts to find a way out for performance failed earlier this year, but I think a sub-5% performance regression is acceptable for blend modes. I wouldn't want to do that to the PREMUL blend, but for the others I think it would be acceptable. They're all so fast anyway. A cleaner and more understandable codebase has its own benefits.

I'd like to avoid interleaving SSE2 and AVX code, again for the simplicity of the codebase.

Starbuck5 avatar Jan 03 '24 04:01 Starbuck5

I've let this PR linger too long, I think it should be closed. It can be revisited in the future if necessary.

Starbuck5 avatar Nov 02 '24 09:11 Starbuck5