SDL Potentially outdated optimisations slowing SDL blitting?

Hello,

Was tinkering with blitting over in pygame again and had a look at SDL_BlitCopy(). It currently has two hardware specific optimisation paths for SSE1 and MMX on x86 see:

https://github.com/libsdl-org/SDL/blob/120c76c84bbce4c1bfed4e9eb74e10678bd83120/src/video/SDL_blit_copy.c#L130-L153

In my very quick testing in pygame, going down either of these paths seemed slower than using the standard SDL_memcpy() path for the function. I suspect this is because SDL_memcpy() (generally an alias for memcpy()) is now using SSE2 internally so these paths are slower.

Caveat: I have not had the free time to rebuild SDL with these lines commented out, just my quickly-hacked-into-pygame copy of the function so there could be something I'm missing in the quick port, but I thought somebody who builds latest SDL regularly might want to give it a try and see if it does improve the speed of the non-alpha channel surface blits.

AFAIK these SSE and MMX paths will only be relevant on x86 machines anyway so if standard memcpy is now faster than these optimisations for 99% of SDL builds on those platforms they should be safely scrappable no?

Jul 14 '22 18:07 MyreMylar

From @Starbuck5 testing over in the pygame repo (linked above) removing these SSE/MMX lines wasn't an easy win as I'd hoped.

24bit odd width blits got a lot faster, but the ones people usually really care about - 24/32 bit even width blits got slower using regular memcopy - even after enabling AVX2.

Somewhat surprising - but that's the results we were seeing. Unless anyone over SDL has any better ideas?

Windows only, but perhaps this is of interest:

https://github.com/microsoft/mimalloc/issues/201

static inline void memcpy_movsb (void *d, const void *s, size_t n) {
	__movsb (d, s, n);
	return;
}

Jul 19 '22 19:07 MyreMylar

24bit odd width blits got a lot faster, but the ones people usually really care about - 24/32 bit even width blits got slower using regular memcopy - even after enabling AVX2.

However, we couldn't replicate the performance numbers of the SDL prebuilts, so there must be a missing factor in our comparisons.

Jul 20 '22 00:07 Starbuck5

I'm going to move this out of the milestone for now. Let us know when you have full benchmarks and what milestone it makes sense to put in any potential changes for testing.

Jul 25 '22 23:07 slouken

Doing a quick test on my quite old linux / intel laptop, with a 3888 x 2592 surface. SSE still seems a little faster or equal compared to only SDL_memcpy (eg no SSE, no MMX). The MMX path seems slower though.

Jul 29 '22 20:07 1bsyl

We have a draft PR to remove MMX code in https://github.com/libsdl-org/SDL/pull/8300

Nov 07 '23 15:11 slouken

SDL SDL copied to clipboard

Potentially outdated optimisations slowing SDL blitting?

SDL
SDL copied to clipboard