SDL
SDL copied to clipboard
Potentially outdated optimisations slowing SDL blitting?
Hello,
Was tinkering with blitting over in pygame again and had a look at SDL_BlitCopy(). It currently has two hardware specific optimisation paths for SSE1 and MMX on x86 see:
https://github.com/libsdl-org/SDL/blob/120c76c84bbce4c1bfed4e9eb74e10678bd83120/src/video/SDL_blit_copy.c#L130-L153
In my very quick testing in pygame, going down either of these paths seemed slower than using the standard SDL_memcpy() path for the function. I suspect this is because SDL_memcpy() (generally an alias for memcpy()) is now using SSE2 internally so these paths are slower.
Caveat: I have not had the free time to rebuild SDL with these lines commented out, just my quickly-hacked-into-pygame copy of the function so there could be something I'm missing in the quick port, but I thought somebody who builds latest SDL regularly might want to give it a try and see if it does improve the speed of the non-alpha channel surface blits.
AFAIK these SSE and MMX paths will only be relevant on x86 machines anyway so if standard memcpy is now faster than these optimisations for 99% of SDL builds on those platforms they should be safely scrappable no?
From @Starbuck5 testing over in the pygame repo (linked above) removing these SSE/MMX lines wasn't an easy win as I'd hoped.
24bit odd width blits got a lot faster, but the ones people usually really care about - 24/32 bit even width blits got slower using regular memcopy - even after enabling AVX2.
Somewhat surprising - but that's the results we were seeing. Unless anyone over SDL has any better ideas?
Windows only, but perhaps this is of interest:
https://github.com/microsoft/mimalloc/issues/201
static inline void memcpy_movsb (void *d, const void *s, size_t n) {
__movsb (d, s, n);
return;
}
24bit odd width blits got a lot faster, but the ones people usually really care about - 24/32 bit even width blits got slower using regular memcopy - even after enabling AVX2.
However, we couldn't replicate the performance numbers of the SDL prebuilts, so there must be a missing factor in our comparisons.
I'm going to move this out of the milestone for now. Let us know when you have full benchmarks and what milestone it makes sense to put in any potential changes for testing.
Doing a quick test on my quite old linux / intel laptop, with a 3888 x 2592 surface. SSE still seems a little faster or equal compared to only SDL_memcpy (eg no SSE, no MMX). The MMX path seems slower though.
We have a draft PR to remove MMX code in https://github.com/libsdl-org/SDL/pull/8300