pygame-ce icon indicating copy to clipboard operation
pygame-ce copied to clipboard

Performance analysis of surface.fill

Open Starbuck5 opened this issue 1 year ago • 8 comments
trafficstars

Introduction

Last week I went and profiled a handful of random games, mostly from pygame community game jams. One thing I noticed is that fill was often one of the higher ranked pygame-ce functions, in terms of runtime. Not as expensive as blits or display.flip or anything, but it tends to be up there. Which surprised me. It doesn't seem like it would be that expensive.

I wondered if pygame-ce's internal fill routines could do a better job than SDL's, so I came up with my own alternative to surface.fill():

# Normal
screen.fill("purple")

# Weird
screen.fill("white", special_flags=pygame.BLEND_SUB)
screen.fill("purple", special_flags=pygame.BLEND_ADD)

And I found that even doing these 2 function calls was faster than a normal fill! Amazing! If going through twice is faster, we could easily make our special_flags=0 routine to take over fill() and do it more efficiently and show a noticeable speed improvement to a typical game.

However, the story is not that simple. I tested a larger surface and SDL was now faster. What gives? Why is SDL better at large surfaces and we are better at small ones, and is there anything we can contribute to them or learn for ourselves from this?

Data

fill() seconds taken (20k repetitions), different strategies

fill() nanoseconds per pixel, different strategies

Raw data: https://docs.google.com/spreadsheets/d/1WBCVvzkL9HAZJ7Yo1N86-tAFhAP0d2J72Wcp8mveCl4/edit?gid=2144097095#gid=2144097095

Benchmarking script
import time
import pygame

pygame.init()
size = 1550
screen = pygame.Surface((size,size))

print(screen, screen.get_pitch())

def fill_purple_normal():
    print("Normal fill")

    screen.fill("purple")
    print(screen.get_at((0,0))) # Color(160, 32, 240, 255)

    start = time.time()
    for _ in range(10000):
        screen.fill("purple")
    print(time.time() - start) # Takes about 0.19 seconds on my system

def fill_purple_weird():
    print("SUB-ADD fill")

    screen.fill("white", special_flags=pygame.BLEND_SUB)
    screen.fill("purple", special_flags=pygame.BLEND_ADD)
    print(screen.get_at((0,0))) # Color(160, 32, 240, 255)

    start = time.time()
    for _ in range(10000):
        screen.fill("white", special_flags=pygame.BLEND_SUB)
        screen.fill("purple", special_flags=pygame.BLEND_ADD)
    print(time.time() - start) # Takes about 0.07 seconds on my system

fill_purple_normal()
fill_purple_weird()

Analysis

  • SDL performance is more bumpy, they seem to like extremely round widths. Width=1200 is significantly faster for them than widths 1050, 1100, 1150, 1250, 1300. I don't think this is an outlier, widths 400 and 800 are also favored by low runtimes in that way
  • I'm running this on an x86_64 system w/AVX2, so our fills are using our AVX2 routines, they are using SSE.
  • Our performance gradient is more smooth overall but after 1000x1000 surfaces our time to process each pixel goes up 4-5x to a new stable threshold.
  • A 1000x1000 32 bpp surface takes up 4MB of memory. My L3 cache is 12MB. Would the amount of cache change the performance threshold?
  • SDL is using aligned, non temporal stores (they use _mm_stream_ps)
  • We are using unaligned, normal stores (_mm_storeu_si128, _mm256_storeu_si256)
  • Source code of what I believe is the SDL routine used here: https://github.com/libsdl-org/SDL/blob/e924f12a7b33678bc71dc96acf3c44142c72c553/src/video/SDL_fillrect.c#L30-L95
  • They have accelerators for ARM as well, so I'm curious to see if on ARM they would be faster than us in every scenario. Like there could be a world where we make our own internal FillRect implementation for x86 but still call theirs on ARM.

This is a very open ended issue, I mainly want to bring up what I've found to those who might also be interested. Potentially we can contribute something to SDL or learn something from their strategy to improve our own.

@MyreMylar @itzpr3d4t0r

Starbuck5 avatar Nov 20 '24 05:11 Starbuck5