AVX2 and SSE4.1 Enhanced Blitters

Hello all, I bring AVX2 and SSE4.1 instrinsic-powered blitters for the BlitNtoNPixelAlpha function.

These blitters have been in use for about a year and a half in the Cogmind community and were originally intended for the game under the SDL 1.2 project. I have ported them to SDL3 with benchmarked enhancements and made them usable in the general case for 4bpp alpha blending.

The SSE4.1 pipeline blends pixels two-wide. The AVX2 pipeline blends pixels four-wide. Both pipelines perform a 4-wide shuffle for color shifting into ARGB format within a buffer before alpha blending.

Benchmark

I have crafted a benchmark that stresses the blitting API by drawing 16px squares randomly in a 800x600px window as fast as possible. Sources here

Performance breakdown

I have tested these blitters on a number of different systems running the three major operating systems for x86_64 as well as under emulated AMD64->ARM64 through Rosetta and Windows 11 ARM.

Some tests have optimizations -Ofast and auto-vectorization enabled with -mavx2 or -msse4.1 and some do not to demonstrate the existing scalar algorithm's full capabilities. The difference in the stock numbers between runs can be attributed to this.

macOS (auto-vectorization, optimized)

2020 i7 1068NG7 MBP (AVX2): 8-9 FPS -> 20-30 FPS
2019 i9 9980HK MBP (AVX2): 5.6 FPS -> 23.81 FPS
2021 M1 Max (SSE4.1 Rosetta+Wine): 11.4 FPS -> 50.0 FPS

Linux (auto-vectorization, optimized, all AVX2)

Steam deck (SteamOS): 7 FPS -> 11-14 FPS
Ryzen 5 5600X (Fedora): 10 FPS -> 22 FPS
Ryzen 7 5800X (Ubuntu 22.04): 10.4 FPS -> 21.74 FPS

Windows (no auto-vectorization)

2020 i7 1068NG7 MBP (AVX2): 0.79 FPS -> 11.2 FPS
2014 i7-4790k Win10 laptop (AVX2): 1 FPS -> 14 FPS
2009 Intel Core2Duo (P8700) MBP (first processor family with SSE4.1): 0.51 FPS -> 5.74 FPS
2021 M1 Max (SSE4.1 through Windows 11 ARM VM): 4.20 FPS -> 20.41 FPS

Huge shoutout to @c4t-4 for donating the Core2Duo laptop which proved invaluable for testing the SSE4.1 code on older platforms!

Existing Issue(s)

Resolves #8178

Sep 06 '23 03:09 aronson

Hmm, any particular reason you put those files in a subdirectory? I don't think we typically do that for intrinsic optimized routines, and they could just go in the video directory with the other blitters.

Sep 06 '23 04:09 slouken

There was -- I wasn't sure how to use the glob macro in the CMakeLists.txt to filter out the modules otherwise. I'll read the macro and give it a shot.

Sep 06 '23 04:09 aronson

Got that sorted.

Sep 06 '23 05:09 aronson

Suggest the following (against your existing patch.)

P.S.: convertPixelFormat (and convertPixelFormatsx4) are too generic names, something better? (Otherwise they may lead to clashes if people link to SDL statically.)

(EDIT: missing diff now added..)

diff --git a/src/video/SDL_blit_A_avx2.c b/src/video/SDL_blit_A_avx2.c
index ccb65f7..cc83157 100644
--- a/src/video/SDL_blit_A_avx2.c
+++ b/src/video/SDL_blit_A_avx2.c
@@ -7,16 +7,13 @@
 #include "SDL_blit.h"
 #include "SDL_blit_A_sse4_1.h"
 
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("avx2")))
-#endif
 /**
  * Using the AVX2 instruction set, blit eight pixels with alpha blending
  * @param src A pointer to four 32-bit pixels of ARGB format to blit into dst
  * @param dst A pointer to four 32-bit pixels of ARGB format to retain visual data for while alpha blending
  * @return A 128-bit wide vector of four alpha-blended pixels in ARGB format
  */
-__m128i MixRGBA_AVX2(__m128i src, __m128i dst) {
+__m128i SDL_TARGETING("avx2") MixRGBA_AVX2(__m128i src, __m128i dst) {
     __m256i src_color = _mm256_cvtepu8_epi16(src);
     __m256i dst_color = _mm256_cvtepu8_epi16(dst);
     const __m256i SHUFFLE_ALPHA = _mm256_set_epi8(
@@ -43,10 +40,7 @@ __m128i MixRGBA_AVX2(__m128i src, __m128i dst) {
     return _mm_add_epi8(mix, dst);
 }
 
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("avx2")))
-#endif
-void BlitNtoNPixelAlpha_AVX2(SDL_BlitInfo *info)
+void SDL_TARGETING("avx2") BlitNtoNPixelAlpha_AVX2(SDL_BlitInfo *info)
 {
     int width = info->dst_w;
     int height = info->dst_h;
diff --git a/src/video/SDL_blit_A_avx2.h b/src/video/SDL_blit_A_avx2.h
index c3fc7b1..e2f054c 100644
--- a/src/video/SDL_blit_A_avx2.h
+++ b/src/video/SDL_blit_A_avx2.h
@@ -1,7 +1,6 @@
 #ifndef SDL_SDL_BLIT_A_AVX2_H
 #define SDL_SDL_BLIT_A_AVX2_H
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("avx2")))
-#endif
-void BlitNtoNPixelAlpha_AVX2(SDL_BlitInfo *info);
+
+void SDL_TARGETING("avx2") BlitNtoNPixelAlpha_AVX2(SDL_BlitInfo *info);
+
 #endif //SDL_SDL_BLIT_A_AVX2_H
diff --git a/src/video/SDL_blit_A_sse4_1.h b/src/video/SDL_blit_A_sse4_1.h
index 47be0dd..47fccd5 100644
--- a/src/video/SDL_blit_A_sse4_1.h
+++ b/src/video/SDL_blit_A_sse4_1.h
@@ -3,21 +3,10 @@
 
 #ifdef SDL_SSE4_1_INTRINSICS
 Uint32 convertPixelFormat(Uint32 color, const SDL_PixelFormat* srcFormat);
+__m128i SDL_TARGETING("sse4.1") convertPixelFormatsx4(__m128i colors, const SDL_PixelFormat* srcFormat);
 
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("sse4.1")))
-#endif
-__m128i convertPixelFormatsx4(__m128i colors, const SDL_PixelFormat* srcFormat);
-
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("sse4.1")))
-#endif
-__m128i MixRGBA_SSE4_1(__m128i src, __m128i dst);
-
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("sse4.1")))
-#endif
-void BlitNtoNPixelAlpha_SSE4_1(SDL_BlitInfo *info);
+__m128i SDL_TARGETING("sse4.1") MixRGBA_SSE4_1(__m128i src, __m128i dst);
+void SDL_TARGETING("sse4.1") BlitNtoNPixelAlpha_SSE4_1(SDL_BlitInfo *info);
 
 #endif
 
diff --git a/src/video/SDL_blit_A_sse4_1.c b/src/video/SDL_blit_A_sse4_1.c
index 3cc852e..512b665 100644
--- a/src/video/SDL_blit_A_sse4_1.c
+++ b/src/video/SDL_blit_A_sse4_1.c
@@ -6,16 +6,13 @@
 
 #include "SDL_blit.h"
 
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("sse4.1")))
-#endif
 /**
  * Using the SSE4.1 instruction set, blit four pixels with alpha blending
  * @param src A pointer to two 32-bit pixels of ARGB format to blit into dst
  * @param dst A pointer to two 32-bit pixels of ARGB format to retain visual data for while alpha blending
  * @return A 128-bit wide vector of two alpha-blended pixels in ARGB format
  */
-__m128i MixRGBA_SSE4_1(__m128i src, __m128i dst) {
+__m128i SDL_TARGETING("sse4.1") MixRGBA_SSE4_1(__m128i src, __m128i dst) {
     __m128i src_color = _mm_cvtepu8_epi16(src);
     __m128i dst_color = _mm_cvtepu8_epi16(dst);
     /**
@@ -45,13 +42,10 @@ Uint32 convertPixelFormat(Uint32 color, const SDL_PixelFormat* srcFormat) {
     return (a << 24) | (r << 16) | (g << 8) | b;
 }
 
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("sse4.1")))
-#endif
 /*
  * This helper function converts arbitrary pixel format data into ARGB form with a 4 pixel-wide shuffle
  */
-__m128i convertPixelFormatsx4(__m128i colors, const SDL_PixelFormat* srcFormat) {
+__m128i SDL_TARGETING("sse4.1") convertPixelFormatsx4(__m128i colors, const SDL_PixelFormat* srcFormat) {
     // Create shuffle masks based on the source SDL_PixelFormat to ARGB
     __m128i srcShuffleMask = _mm_set_epi8(
         srcFormat->Ashift / 8 + 12, srcFormat->Rshift / 8 + 12, srcFormat->Gshift / 8 + 12, srcFormat->Bshift / 8 + 12,
@@ -64,10 +58,7 @@ __m128i convertPixelFormatsx4(__m128i colors, const SDL_PixelFormat* srcFormat)
     return _mm_shuffle_epi8(colors, srcShuffleMask);
 }
 
-#if !defined(_MSC_VER) || (defined(_MSC_VER) && defined(__clang__))
-__attribute__((target("sse4.1")))
-#endif
-void BlitNtoNPixelAlpha_SSE4_1(SDL_BlitInfo* info) {
+void SDL_TARGETING("sse4.1") BlitNtoNPixelAlpha_SSE4_1(SDL_BlitInfo* info) {
     int width = info->dst_w;
     int height = info->dst_h;
     Uint8 *src = info->src;

P.S./2: Attempted build using old gcc4.9, got the following error: (I know, not ultra important, but letting you know anyway..)

src/video/SDL_blit_A_sse4_1.c: In function 'BlitNtoNPixelAlpha_SSE4_1':
src/video/SDL_blit_A_sse4_1.c:82:13: warning: implicit declaration of function '_mm_loadu_si64' [-Wimplicit-function-declaration]
             __m128i c_src = _mm_loadu_si64((buffer + (i * 8)));
             ^
SDL/src/video/SDL_blit_A_sse4_1.c:82:29: error: incompatible types when initializing type '__m128i' using type 'int'
             __m128i c_src = _mm_loadu_si64((buffer + (i * 8)));
                             ^
src/video/SDL_blit_A_sse4_1.c:83:29: error: incompatible types when initializing type '__m128i' using type 'int'
             __m128i c_dst = _mm_loadu_si64((dst + i * 8));
                             ^
src/video/SDL_blit_A_sse4_1.c:85:13: warning: implicit declaration of function '_mm_storeu_si64' [-Wimplicit-function-declaration]
             _mm_storeu_si64(dst + i * 8, c_mix);
             ^
src/video/SDL_blit_A_sse4_1.c:95:33: error: incompatible types when initializing type '__m128i' using type 'int'
                 __m128i c_src = _mm_loadu_si64(src_ptr);
                                 ^
src/video/SDL_blit_A_sse4_1.c:97:33: error: incompatible types when initializing type '__m128i' using type 'int'
                 __m128i c_dst = _mm_loadu_si64(dst_ptr);
                                 ^


src/video/SDL_blit_A_avx2.c: In function 'BlitNtoNPixelAlpha_AVX2':
src/video/SDL_blit_A_avx2.c:78:17: warning: implicit declaration of function '_mm_loadu_si64' [-Wimplicit-function-declaration]
                 __m128i c_src = _mm_loadu_si64(src_ptr);
                 ^
src/video/SDL_blit_A_avx2.c:78:33: error: incompatible types when initializing type '__m128i' using type 'int'
                 __m128i c_src = _mm_loadu_si64(src_ptr);
                                 ^
src/video/SDL_blit_A_avx2.c:80:33: error: incompatible types when initializing type '__m128i' using type 'int'
                 __m128i c_dst = _mm_loadu_si64(dst_ptr);
                                 ^
src/video/SDL_blit_A_avx2.c:82:17: warning: implicit declaration of function '_mm_storeu_si64' [-Wimplicit-function-declaration]
                 _mm_storeu_si64(dst_ptr, c_mix);
                 ^

Sep 06 '23 13:09 sezero

P.S./2: Attempted build using old gcc4.9, got the following error: (I know, not ultra important, but letting you know anyway..)

Hm, looks like an oversight in gcc < 9: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78782 patched later for gcc9 .

Sep 06 '23 13:09 sezero

P.S./2: Attempted build using old gcc4.9, got the following error: (I know, not ultra important, but letting you know anyway..)

Hm, looks like an oversight in gcc < 9: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78782 patched later for gcc9 .

Doing the following, it builds for me: [EDIT: better patch in next post]

Sep 06 '23 14:09 sezero

P.S./2: Attempted build using old gcc4.9, got the following error: (I know, not ultra important, but letting you know anyway..)

Hm, looks like an oversight in gcc < 9: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78782 patched later for gcc9 .

After some more googling and messing at godbolt.org:

_mm_loadu_si64 :

missing in clang < 3.9
missing in gcc < 9

_mm_storeu_si64 :

missing in clang < 8
missing in gcc < 9

And __m128i_u type (if to be used to define the missing two above):

missing in gcc < 7
missing in clang < 9

Suggest the following patch for it:

diff --git a/src/video/SDL_blit_A_avx2.c b/src/video/SDL_blit_A_avx2.c
index cc83157..5e0780b 100644
--- a/src/video/SDL_blit_A_avx2.c
+++ b/src/video/SDL_blit_A_avx2.c
@@ -5,4 +5,6 @@
 #ifdef SDL_AVX2_INTRINSICS
 
+#define SDL_blit_A_avx2_c
+
 #include "SDL_blit.h"
 #include "SDL_blit_A_sse4_1.h"
diff --git a/src/video/SDL_blit_A_sse4_1.c b/src/video/SDL_blit_A_sse4_1.c
index 512b665..9e33916 100644
--- a/src/video/SDL_blit_A_sse4_1.c
+++ b/src/video/SDL_blit_A_sse4_1.c
@@ -5,5 +5,8 @@
 #ifdef SDL_SSE4_1_INTRINSICS
 
+#define SDL_blit_A_sse4_1_c
+
 #include "SDL_blit.h"
+#include "SDL_blit_A_sse4_1.h"
 
 /**
diff --git a/src/video/SDL_blit_A_sse4_1.h b/src/video/SDL_blit_A_sse4_1.h
index 47fccd5..4b16e95 100644
--- a/src/video/SDL_blit_A_sse4_1.h
+++ b/src/video/SDL_blit_A_sse4_1.h
@@ -11,3 +11,35 @@
 #endif
 
+/* for compatibility with older compilers: */
+#if defined(SDL_blit_A_sse4_1_c) || defined(SDL_blit_A_avx2_c)
+/* _mm_loadu_si64 : missing in clang < 3.9, missing in gcc < 9
+ * _mm_storeu_si64: missing in clang < 8.0, missing in gcc < 9
+ * __m128i_u type (to be used to define the missing two above):
+ *		      missing in gcc < 7, missing in clang < 9
+ */
+#if defined(__clang__)
+#if (__clang_major__ < 9)
+#define MISSING__m128i_u
+#endif
+#if (__clang_major__ < 8)
+#define MISSING__mm_storeu_si64
+#endif
+#elif defined(__GNUC__)
+#if (__GNUC__ < 7)
+#define MISSING__m128i_u
+#endif
+#if (__GNUC__ < 9)
+#define MISSING__mm_storeu_si64
+#endif
+#endif
+
+#ifdef MISSING__m128i_u
+typedef long long __m128i_u __attribute__((__vector_size__(16), __may_alias__, __aligned__(1)));
+#endif
+#ifdef MISSING__mm_storeu_si64
+#define _mm_loadu_si64(_x) _mm_loadl_epi64((__m128i_u*)(_x))
+#define _mm_storeu_si64(_x,_y)  _mm_storel_epi64((__m128i_u*)(_x),(_y))
+#endif
+#endif /**/
+
 #endif //SDL_SDL_BLIT_A_SSE4_1_H

Sep 06 '23 16:09 sezero

Pushed some changes per your suggestions.

Thank you for the changes for older GCC and clang. I tested the implementations by stubbing them in on modern clang and verifying rendering was identical. I renamed the convertPixelFormat functions to something less likely to collide with anyone's codebase. I added the targeting macros for the sse4.1 and avx2 attributes. I undid the changes I made to the CMakeLists.txt as they aren't necessary.

Sep 06 '23 23:09 aronson

@aronson, can you add some test cases to testautomation to validate these blitters, so we can catch regressions in the future?

Sep 07 '23 00:09 slouken

You have set this to handle all 32-bit formats, but as far as I can tell the dst format must be ARGB.

Sep 07 '23 08:09 0x1F9F1

@0x1F9F1 That's a great observation. A blit request could be anything, not just rendering. I ran into some issues at runtime where the destination format was seemingly blank for these calls. I will add a conditional call to the color shift shuffle to fix it up when the destination format is specified and test exotic destination formats. We have a lot of performance headroom compared to the scalar implementation and I don't foresee this slowing things down much.

I'm also adding the test cases. I reviewed the surface, video, and pixel automation. I have about half a dozen tests in mind. I saw that the blitting API is already tested in the surface automation. I'm unsure if I should extend that test suite or write my own module.

Should have these updates resolved before the end of the week.

Sep 07 '23 08:09 aronson

Also, reading your comment https://github.com/libsdl-org/SDL/issues/8178#issuecomment-1698357961 regarding shuffling, I think what you are looking for is unpacklo_epi16, unpackhi_epi16, and packus_epi16. Something like:

__m128i colors16 = /* ... */;

__m128i colors_lo = _mm_unpacklo_epi8(colors16, _mm_setzero_si128());
__m128i colors_hi = _mm_unpackhi_epi8(colors16, _mm_setzero_si128());

/* Do the mixing here */ 

colors16 = _mm_packus_epi16(colors_lo, colors_hi);

Then you can work at the full vector width, and avoid having to shuffle the data in a separate loop.

I will add a conditional call to the color shift shuffle to fix it up when the destination format is specified and test exotic destination formats.

Perhaps you could just blend directly in the dst format? Basically what you are already doing, but just changing srcShuffleMask and SHUFFLE_ALPHA to work with any dst format instead. If you compute both of them outside the loop there shouldn't be any performance hit.

Sep 07 '23 09:09 0x1F9F1

unpacklo_epi16, unpackhi_epi16, and packus_epi16

Wow... these look great! Will explore their use, thank you. EDIT: Unfortunately the saturation in packus_epi16 is not suitable for the algorithm. The unpack worked great until I had to reduce the result of the multiply. Perhaps I'm missing something, but I get oversaturated colors when I use these methods. I was able to fix it for the most part by shifting or truncating/masking the epi16 members but the color is still wrong in the extremes. Further fixups would add instructions. I'm going to stick with the current implementation that does the reduce with a shuffle. Perhaps a future enhancement to investigate.

just changing srcShuffleMask and SHUFFLE_ALPHA to work with any dst format

This approach makes more sense. I investigated the issue I ran into earlier and it seems that some applications (at least Cogmind) resolve to a dstfmt with both Ashift and Bshift set to 0 by the time we reach the blitting functions. This mangles rendering. I added a check for this case to assume ARGB, but I'm not sure it's correct. This is what I settled on that's working:

__m128i SDL_TARGETING("sse4.1") AlignPixelToSDL_PixelFormat_x4(__m128i colors, const SDL_PixelFormat* srcFormat, const SDL_PixelFormat* dstFormat) {
    // Calculate shuffle indices based on the source and destination SDL_PixelFormat
    Uint8 shuffleIndices[16];
    Uint8 dstAshift = dstFormat->Ashift / 8;
    Uint8 dstRshift = dstFormat->Rshift / 8;
    Uint8 dstGshift = dstFormat->Gshift / 8;
    Uint8 dstBshift = dstFormat->Bshift / 8;
    // Handle case where bad input sent
    if (dstAshift == dstBshift && dstAshift == 0) {
        dstAshift = 3;
    }
    for (int i = 0; i < 4; ++i) {
        shuffleIndices[dstAshift + i * 4] = srcFormat->Ashift / 8 + i * 4;
        shuffleIndices[dstRshift + i * 4] = srcFormat->Rshift / 8 + i * 4;
        shuffleIndices[dstGshift + i * 4] = srcFormat->Gshift / 8 + i * 4;
        shuffleIndices[dstBshift + i * 4] = srcFormat->Bshift / 8 + i * 4;
    }

    __m128i shuffleMask = _mm_set_epi8(
            shuffleIndices[15], shuffleIndices[14], shuffleIndices[13], shuffleIndices[12],
            shuffleIndices[11], shuffleIndices[10], shuffleIndices[9], shuffleIndices[8],
            shuffleIndices[7], shuffleIndices[6], shuffleIndices[5], shuffleIndices[4],
            shuffleIndices[3], shuffleIndices[2], shuffleIndices[1], shuffleIndices[0]
    );

    return _mm_shuffle_epi8(colors, shuffleMask);
}

Sep 07 '23 09:09 aronson

Unfortunately the saturation in packus_epi16 is not suitable for the algorithm. The unpack worked great until I had to reduce the result of the multiply.

You might have been using the wrong shift. The 16-bit values should already be within the [0, 255] range before packing them, so saturation shouldn't be an issue. srli_epi16 is the shift you would need in this case (unsigned shift right within the 16-bit lanes). srli_epi16(colors, 8) would divide by 256, which I think matches what you are currently doing.

Also just in case you didn't know, Intel has a nice intrinsics guide which helps with finding SIMD instructions. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#

Sep 08 '23 10:09 0x1F9F1

Thanks, retried from scratch this morning and I got it working.

Buffer allocation removed from SSE4.1 blitter. Simplified the for loop use.

The _mm_srli_epi16(mixed, 8) is definitely necessary to prevent loss of color data. packus as-is won't handle the resultant overflowed multiplication without the shift in this algorithm. If you remove the srli calls in my newly-pushed commit within MixRGBA_SSE4_1 and the single-pixel overflow blit in the SSE4.1 and AVX2 modules it will break rendering in extreme alpha values.

~~Benchmarking on three systems didn't demonstrate any measurable loss or gain of performance with these changes that couldn't be attributed to typical variance.~~ EDIT: It seems I was wrong, we get a measurable 10-20% speedup in the synthetic benchmark over time, but the rendering is subtly incorrect in the benchmark and my WIP tests when there are successive blits. The commit before my unpack and packus use is fine (as is main). Reverting that for now. EDIT 2: I was performing a double unpack in the one pixel case, it works as expected now with the new intrinsics. Will push along with my tests when ready.

Still working on the test suite. I have a clear vision of the tests.

Sep 08 '23 12:09 aronson

As I began testing blending PRNG noise and hashing the resultant pixels, I discovered that my blitters are not bit accurate to the original scalar algorithms.

EDIT: A binary compare of the bitmaps reveals it's the issue with alpha values, working to fix it now

Attached at the end are stock.png and exp.png which are the result of a short test. Comparing side by side, you can see the stock blitters produce "brighter" blends. I've tried a few quick modifications to fix this, but will need more time to get around the issue with a thorough solution.

There's a few approaches I think I can try:

~~Hide these with a conditional compile flag with the caveat the performance gains are not color accurate~~
Modify the code to handle the alpha channel blending with the original algorithm instead of treating it identically to color channels
- This is a correctness issue I discovered while debugging
~~Change the width of the blending pipelines (perhaps use 4 packed floats for one pixel)~~

Stock: stock PR current state: exp

Sep 09 '23 15:09 aronson

Can you verify the correctness of the blending, both in SDL and in your blitters? We have an open issue to go back and look at the SDL blitters and make sure they're doing the correct math, so you might be seeing an existing bug.

Sep 09 '23 18:09 slouken

We have an open issue to go back and look at the SDL blitters and make sure they're doing the correct math

Would that be https://github.com/libsdl-org/SDL/issues/4484 ?

Sep 09 '23 18:09 sezero

Yep, that's the one, thanks!

Sep 09 '23 18:09 slouken

Also related: https://github.com/libsdl-org/SDL/issues/6990

Sep 09 '23 18:09 slouken

Ah, this context explains some things. Much appreciated. I will review the algorithms against the correct implementation.

Sep 09 '23 18:09 aronson

I found this whitepaper on division-free accurate integer alpha blending which has an interesting citation number 2, pointing to SDL2's SDL_Blit_RGBA8888_RGB888_Blend function and noting an error that leads to a loss of accuracy:

As of January 2022 (commit 120c76c), this function explicitly divides the integer product of an 8-bit alpha component and 8-bit color component by 255, but then loses a half-bit of accuracy by not rounding off this product before dividing. For this case (integer division by 255), Blinn [5] shows that rounding should be done by adding 127, not 128.

I imagine the other alpha blend functions may have similar bones buried. I'm going to implement the algorithm in this paper as the scalar baseline for correctness as it makes this claim:

When the same blending calculation is carried out to high precision using double-precision floating-point division operations, the results are found to exactly match those produced by this approximation

Of particular interest is the assertion of the reason typical algorithms have this cumulative loss of color data as seen in the issue linked

If we run a test program that supplies all 65,536 possible combinations of 0 ≤ αraw ≤ 255 and 0 ≤ rraw ≤ 255 as inputs, we will find that for all but 24 of these combinations, the values generated by this approximation exactly match those obtained by explicitly dividing by 255. For the 24 mismatches, the approximation always generates a value that is too low by just one. These encouraging test results suggest that a gentle nudge might be sufficient to increase the 24 mismatching values just enough that all 65,536 cases match exactly. In fact, the only change to equation (3) that’s needed to achieve 100- percent perfect matching is to increase the rounding constant from 800016 to 808016.

EDIT: Looks like the whitepaper is a small piece of the puzzle with regard to efficient conversion of surfaces from straight alpha to premultiplied alpha format. I think I see the problem at play here and the correct behavior. Will continue research and testing.

Sep 09 '23 19:09 aronson

Some thoughts I had on how to efficiently implement SDL_BLENDMODE_BLEND https://gist.github.com/0x1F9F1/5ef2f3f822716fd375ef58c733b0251a

Sep 09 '23 20:09 0x1F9F1

Thank you, those algorithms were invaluable.

If we presume Brick's implementations are accurate (I have no doubt after reviewing them), the current stock BlitNtoNPixelAlpha, BlitRGBtoBGRPixelAlpha, and friends have off by one errors for color values in a few cases. The stock alpha is also incorrect by more than one in some situations when blending random noise.

Brick's SSE pipeline produces *nearly bit-perfect results (occasionally alpha is off by one) compared to the correct scalar algorithm even after many iterations of random blends. It should work flawlessly for the SSE4.1 case, and I will port the AVX2 case to have parity.

Once that's done I'll port my test suite for the blitters to the SDL testautomation framework and we should be good to go.

EDIT: AVX2 port done, tested benchmarks on three systems. Speedups remain consistent from the original post between scalar and SSE4.1/AVX2, with an overall 10-20% speedup compared to my original PR commit.

Sep 10 '23 05:09 aronson

occasionally alpha is off by one

What algorithm are you using to judge correct alpha? AFAIK my simd version should produce bit-perfect results for alpha. (If you are using the one from ALPHA_BLEND_RGBA then it has the same issues as the color channels I mentioned)

Sep 10 '23 08:09 0x1F9F1

I was using blend_mul2.

I hacked BlitNtoNPixelAlpha to handle all blits and made this change:

        DISEMBLE_RGBA(src, srcbpp, srcfmt, Pixel, sR, sG, sB, sA);
        if (sA) {
            DISEMBLE_RGBA(dst, dstbpp, dstfmt, Pixel, dR, dG, dB, dA);
            RGBA_FROM_PIXEL(Pixel, dstfmt, dR, dG, dB, dA);
            dR = blend_mul2(sR, dR, sA);
            dG = blend_mul2(sG, dG, sA);
            dB = blend_mul2(sB, dB, sA);
            dA = (Uint8)((int)sA + dA - ((int)sA * dA) / 255);
            ASSEMBLE_RGBA(dst, dstbpp, dstfmt, dR, dG, dB, dA);
        }

I'm not 100% sure on the alpha calculation. Could be the source of my error. This scalar impl has results different than your SIMD over ~250k blits that appear to be accumulation of off by one errors. I think I need to adjust the alpha component blend somehow, but using blend_mul2 there gets me results even more off (perhaps I'm using it wrong?). ~~Setting source alpha to 255 as the pipeline does made no difference on what's emitted.~~

Example compare of the bitmaps:

They are nearly visually identical.

Sep 10 '23 12:09 aronson

Nevermind, I had an error in my test where I was sampling noise one pixel more than I should. When I cleaned that up, I got bit-perfect results. Not sure why that caused it to vary or look like an off-by-one error.

Here is the preliminary SVGA random noise blend test if you want to check yourself. ~~You should get the same hash check value as I have hardcoded.~~ Having trouble making sure it's portable, think I have errors under macOS mingw (I've run into compiler bugs on that specific flavor with these blitters before, so isolating that). macOS clang gave me this hash. This has been a useful test so far.

#include <SDL3/SDL.h>
#include <stdlib.h>
#include <stdio.h>

uint64_t rotl(uint64_t x, int k) { return (x << k) | (x >> (-k & 63)); }
uint64_t next(uint64_t state[2]) {
    uint64_t s0 = state[0], s1 = state[1];
    uint64_t result = rotl((s0 + s1) * 9, 29) + s0;
    state[0] = s0 ^ rotl(s1, 29);
    state[1] = s0 ^ s1 << 9;
    return result;
}
static uint64_t rngState[2] = {1, 2};
uint32_t get_random_uint32() {
    return (uint32_t)next(rngState);
}

int testRandomToRandomSVGAMultipleIterations();

SDL_Surface* getNextRandomSurface(SDL_PixelFormatEnum format) {
    # define BUF_LENGTH (15 * 15)
    Uint32 pixels[BUF_LENGTH];

    for (int i = 0; i < BUF_LENGTH; i++) {
        pixels[i] = get_random_uint32();
    }
    return SDL_CreateSurfaceFrom(pixels, 15, 15, 15 * 4, format);
}

long hashSurfacePixels(SDL_Surface* surface) {
    const int length = surface->w * surface->h;
    long hash = 0;
    for (int i = 0; i < length; i++){
        Uint32 pixel = ((Uint32*)(surface->pixels))[i];
        hash = (hash + (324723947 + pixel) ^93485734985);
    }
    return hash;
}

int main(int argc, char* argv[]) {
    // Initialize SDL
    if (SDL_Init(SDL_INIT_VIDEO) < 0) {
        printf("SDL could not initialize! SDL_Error: %s\n", SDL_GetError());
        return 1;
    }

    (testRandomToRandomSVGAMultipleIterations());

    return 0;
}

int testRandomToRandomSVGAMultipleIterations() {
    const int width = 800;
    const int height = 600;
    SDL_Surface* destSurface = SDL_CreateSurface(width, height, SDL_PIXELFORMAT_RGBA8888);

    for (int i = 0; i < 250000; i++) {
        SDL_Surface *sourceSurface = getNextRandomSurface(SDL_PIXELFORMAT_RGBA8888);

        SDL_Rect destRect;
        int location = (int)get_random_uint32();
        destRect.x = location % (width - 15 - 1);
        destRect.y = location % (height - 15 - 1);

        SDL_BlitSurface(sourceSurface, NULL, destSurface, &destRect);

        SDL_DestroySurface(sourceSurface);
    }
    long hash = hashSurfacePixels(destSurface);
    printf("Random to Random SVGA blit: %lx\n", hash);
    //SDL_SaveBMP(destSurface, "output.bmp");

    SDL_DestroySurface(destSurface);
    return hash == 0x402c4e703e682;
}

Sep 10 '23 12:09 aronson

I've verified Brick's SIMD and mul2 algorithms produce comparable (visual) results to the classic algorithms.

These were my implementations:

uint8_t blend_classic(uint8_t sC, uint8_t dC, uint8_t sA)
{
    return ((((sC - dC) * (int)sA) / 255) + dC);
}
uint8_t blend_classic_alpha(uint8_t srcA, uint8_t dstA) {
    return srcA + ((dstA * (255 - srcA)) / 255);
}
// double implementations of classic
uint8_t blend_classic_double(uint8_t sC, uint8_t dC, uint8_t sA) {
    const double norm = 255.0;
    double result = (((double)(sC - dC) * sA) / norm) + dC;
    return (uint8_t)result;
}
uint8_t blend_classic_alpha_double(uint8_t srcA, uint8_t dstA) {
    const double norm = 255.0;
    double result = srcA + ((dstA * (norm - srcA)) / norm);
    return (uint8_t)result;
}
// From Brick
uint8_t blend_mul2(uint8_t sC, uint8_t dC, uint8_t sA)
{
    return ((uint32_t)(((sC - dC) * sA) + ((dC << 8) - dC)) * 0x8081u) >> 23;
}
uint8_t blend_mul2_alpha(uint8_t srcA, uint8_t dstA) {
    return ((uint32_t)(((srcA - dstA) * srcA) + ((dstA << 8) - dstA)) * 0x8081u) >> 23;
}
uint8_t blend_mul2_alpha_premultiplied(uint8_t srcA, uint8_t dstA) {
    return ((uint32_t)(srcA + ((dstA * (255 - srcA)) * 0x8081u)) >> 23);
}
// Used as "correct" reference for accuracy
uint8_t blend_scalar_double(uint8_t sC, uint8_t dC, uint8_t sA)
{
    const double norm = 255.0;
    return (uint32_t)(((double)sA*sC)/norm + (1.0 - sA/norm) * dC);
}

Classic integer blitting (hash: 0x41411a5693eff): classic_scalar

Classic double blitting (hash: 0x411361f841342): classic_double

SIMD blitting (hash: 0x40cb9d91960d8): mul2_sse4_1

Sep 10 '23 16:09 aronson

I am a bit mystified. main's blitters under Windows and Linux x86_64 and ARM64 don't seem to be deterministic to the pixel input no matter what compiler I try. On macOS they are deterministic and I get reproducible hashes. This is true of any blitter I try using as a replacement, SIMD, scalar, no matter.

Note that Cogmind renders fine, as it only performs at max ~4 stacked blits before clearing a surface entirely. EDIT: I tried to run the test on my n3ds and ran into an "undefined instruction" crash trying to init video. Will look around to see if this is a known issue and open one if not.

Perhaps I have an error in my test, but the random number stream I feed to the create surface from function is identical between runs and platforms.

This is subsequent runs of the 250k blitter against main with all default compile options.

The images seem corrupted compared to what I'd expect (and have seen on macOS) for uniformly distributed random noise.

Main blitter Linux x86_64 GCC 11.4:

SIMD blitter Linux x86_64 GCC 11.4:

Main blitter Linux ARM64 GCC 11.4:

Sep 11 '23 00:09 aronson

SDL_Surface* getNextRandomSurface(SDL_PixelFormatEnum format) {
    # define BUF_LENGTH (15 * 15)
    Uint32 pixels[BUF_LENGTH];

    for (int i = 0; i < BUF_LENGTH; i++) {
        pixels[i] = get_random_uint32();
    }
    return SDL_CreateSurfaceFrom(pixels, 15, 15, 15 * 4, format);
}

See the docs for SDL_CreateSurfaceFrom:

No copy is made of the pixel data. Pixel data is not managed automatically;
you must free the surface before you free the pixel data.

Sep 11 '23 08:09 0x1F9F1

Implement mathematically correct scalar and x86 SIMD blitters

AVX2 and SSE4.1 Enhanced Blitters

Benchmark

Performance breakdown

macOS (auto-vectorization, optimized)

Linux (auto-vectorization, optimized, all AVX2)

Windows (no auto-vectorization)

Existing Issue(s)