pygame-ce
pygame-ce copied to clipboard
Optimized SSE2 alpha blit algorithm
Remake of #2680. Sort of a continuation of #2601. This should solve some issues with CI and clear the mess that was the previous PR.
The old strategy of processing 2 pixels at once for even widths or 1 pixel at a time for odd widths has been replaced. Now, it processes 4 pixels at once for (width // 4) iterations, and 1, 2, or 3 pixels as needed.
This change speeds up the algorithm by approximately 55% for even blits and 170% for odd blits.
Hey, this is a great avenue for a PR, the SSE alpha blitters are some of the oldest SIMD code we have, and we've learned new strategies since they were written.
In the old algorithm, since it was going one pixel at a time in the odd path, it had a fast path where it would just copy over the pixel data if certain conditions were met. How does the performance now stack up against that fast path?
Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism?
I can give a partial answer because I'm in a hurry but thanks for the comment, i didn't test that. Using the same test i showed in the previous PR with a surface set to 0 alpha or 255 alpha i get the folowing result:
Basically the new strategy is faster (60%) when the surface has odd width and a bit slower (25-30%) in the even case. This may suggest further changes are needed but since these are edge cases (i mean when does someone blit something to a 0 alpha destination or a full 255 alpha surface onto another using alpha?) I don't think this is too bad.
Random test fails aside (not caused by this PR) I've managed to improve performance by 15-20% by removing some unnecessary instructions and now the situation is much better compared to the old strategy in the dst_alpha=0 case as well as src=255:
Anyway about this:
Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism?
The changes are mainly focusing on improving parallelism as I'm following the sdl formula exactly. I tried to save a few unpack here and there but it's really not much different from before.
Test program:
from sys import stdout
from pstats import Stats
from cProfile import Profile
import pygame
pygame.init()
def speed_test_blits():
# chosen because they contain edge cases.
nums = [0, 1, 65, 126, 127, 199, 254, 255]
results = {}
for iterations in range(0, 10000):
for dst_r, dst_b, dst_a in zip(nums, reversed(nums), reversed(nums)):
for src_r, src_b, src_a in zip(nums, reversed(nums), nums):
src_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
src_surf.fill((src_r, 255, src_b, src_a))
dest_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
dest_surf.fill((dst_r, 255, dst_b, dst_a))
dest_surf.blit(src_surf, (0, 0))
key = ((dst_r, dst_b, dst_a), (src_r, src_b, src_a))
results[key] = dest_surf.get_at((65, 33))
if __name__ == "__main__":
profiler = Profile()
profiler.runcall(speed_test_blits)
stats = Stats(profiler, stream=stdout)
stats.strip_dirs()
stats.sort_stats("cumulative")
stats.print_stats()
Results on my Raspberry Pi:
Current main branch:
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.606 2.606 19.145 19.145 blitter_speed_test.py:11(speed_test_blits)
640000 13.319 0.000 13.319 0.000 {method 'blit' of 'pygame.surface.Surface' objects}
1280000 2.997 0.000 2.997 0.000 {method 'fill' of 'pygame.surface.Surface' objects}
640000 0.223 0.000 0.223 0.000 {method 'get_at' of 'pygame.surface.Surface' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
With PR:
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.642 2.642 13.446 13.446 blitter_speed_test.py:11(speed_test_blits)
640000 7.507 0.000 7.507 0.000 {method 'blit' of 'pygame.surface.Surface' objects}
1280000 3.056 0.000 3.056 0.000 {method 'fill' of 'pygame.surface.Surface' objects}
640000 0.241 0.000 0.241 0.000 {method 'get_at' of 'pygame.surface.Surface' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Looks like a solid improvement here when the SSE2 code is converted to NEON code.
Thanks for the ping on discord, this had slipped off my radar.
Every time I write or review one of our blitter functions I write a fuzzer test to make sure the output is exactly the same before and after.
import random
import hashlib
import pygame
pygame.init()
random.seed(36)
def populate_surf(surf):
for y in range(surf.get_height()):
for x in range(surf.get_width()):
surf.set_at(
(x, y),
(
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
random.randint(0, 255),
),
)
sizes = [(random.randint(1, 50), random.randint(1, 50)) for _ in range(30)]
offsets = [(random.randint(1, 20), random.randint(1, 20)) for _ in range(30)]
expected_hashes = [
"d35958573201ec5788755bc9e3e471bb5415203144671643ad9bfb908f7df646",
"4f5102ac56cf56462c1fc66d395d9c44777a3b756b82a28b9beb8342f0d0287c",
"f59a078f67688d66b65a542ed8faa7eacefc91a27db2a5dc85ebc736324100a5",
"94bc01747d7a182bfc8adf39b43b00ec46c4733a348181529c64cdfbea615c39",
"e14cffeaeef21879ed6ec6c29171e9645ecd89dd95449914994f1a7ec9b9ccfd",
"a6d5f758b65088347f896043a92c084d000c1d9bb66319d8dda4f1f2f5f7757f",
"9d3940ae03ed0f985625c351736ac5ee07dec7e170f76370fdbd8b9257541a1f",
"83985622f2f39aebe2efa7b8765e567e832e79d7b42d842afb8cc064596ed5eb",
"3eb6988b34dd22689bb5a0b898b789dc47fd34eb4fe113831564bec02d9f4eb6",
"66b3c4d65ecb300f12f603a778841c1fa87506ca9923bf72f51f5773566446f3",
"50c8c736cc2bcf6197ecb5929b044bfdbde393f86e5e0373538c87e4e71ba645",
"04d04558ad1efe6432c3940293ebe614c68935e94a7eed1577d2cd1246a7e043",
"240c1db632a0f3c7d0804aca4d2a0937bf428a25bc9517686d0b313035283d0a",
"a70d794fde1ccbec2addbccd87566b3dc6d89d22969156184e68cfca4b26639b",
"1f7c02681cbb4ca7c4d6e21bc05b28ad3011984d77e0928960dfc4454573d97f",
"c421f931cb6cbbeb13aa194967e2575d4f94a9dcfd9d3db197fc8cf7de1cd3f6",
"6942e0a43afba737ff20ad0418a6a6700a166566f347707dbd80c28c9383e720",
"93a4be6d2aa71396fe6c8fcb78fad62c8d4e72a22df76c3d1dd9273123dbb15d",
"867551157ab0c352a5b730664087322e5757c4c50c01ce5778c55e520852c325",
"15b6c9a4051dd8f2499cd8244c4f0d3effab225ce6eba254e2e6f83da3156a82",
"fc280abba3b0c05cf8f3a3d67c8bf946a9509f15d00538c41ce9191bb2849dce",
"22c14fb6f991056287b2aceab9e9a3630bbacc00529564b91a85183cb2559c3c",
"31b70c1c6da2a55036d0139da7ee1d0171bc87ca80f5e8d3f8b60ef3d714fb16",
"e2e87c6cc6b07b64e8d9367c28c19d6cd4b712db2fd0514ac941def98470bcf7",
"a67aa0f76b96ac26c5bfa2f558431b5244a84fe40843d17a85d1bfdb80991a6f",
"488d7f0cc883e498575f261f52df7cc142cb0f0244d7722f28018c6e87542b31",
"c6ad14ba201acaba673381093fc7f04764cae6ab1b4c724cd4336dd25341ed7c",
"47573d029dd21db29351074501033a5c029f47e34d625d39e1be166d5c4ed537",
"1c3e54146c2a0d4918f64047525dc79fd45c2c3db74d9dea5dd4a9c862e32a87",
"1940432ae8813a746d4201c14ccc6456d93fb65680c3a69861967579ab14a03e",
]
hashes = []
for size, offset in zip(sizes, offsets):
source = pygame.Surface(size, pygame.SRCALPHA)
populate_surf(source)
dest = pygame.Surface(
(random.randint(1, 50), (random.randint(1, 50))), pygame.SRCALPHA
)
populate_surf(dest)
dest.blit(source, offset)
sha256 = hashlib.sha256()
sha256.update(pygame.image.tobytes(dest, "RGBA"))
digest = sha256.hexdigest()
assert digest == expected_hashes.pop(0)
hashes.append(digest)
print(digest)
print(hashes)
print("Success")
This script passes successfully on both SSE2 and AVX2 implementations before this PR, but raises an exception in the SSE2 implementation in this PR. I have not tracked down exactly what is different, but this shows that the new implementation is not producing the same results as the old one. If I adjust the code to print instead of assert I can see that 22/30 cases are the same, it's just 8/30 doing something different.
Retested the correctness with my script and with an expanded version that checks more random cases, I'm now highly confident that this PR exhibits identical behavior to the preceding implementation.