SDL ARM assembly to address performance of blit and fill routines

ARM assembly to address performance of blit and fill routines

Open sezero opened this issue 3 years ago • 5 comments

Originally posted by @bavison in https://github.com/libsdl-org/SDL-1.2/issues/777#issuecomment-871726562:

I think I can see what's happened. In SDL 2.0, the way BlitRGBtoRGBPixelAlpha() handled the alpha channel changed in an incompatible way. To be precise, in this commit:

https://github.com/libsdl-org/SDL/commit/89bc80f1ae5bd99db448eccbc19ba2981722da1f

There is no equivalent commit in SDL 1.2. My ARM assembly optimisations - both the SIMD and NEON versions - faithfully copied the SDL 1.2 behaviour (this was, after all, my primary target at the time). It looks like nobody, including myself, noticed the difference between the two branches.

First conclusion: I think the code can safely be re-enabled on SDL 1.2.

Having looked more closely at the SDL 2.0 code, I think the new handling of alpha is actually incorrect, according to its own specification. It's desirable that every colour component is treated identically (not least because it makes SIMD processing easier). Look at the least-significant byte of d1/s1 (blue) - we can ignore the AND mask because the intermediate value can't overflow for any combination of input values - we can rearrange

d1 + ((s1 - d1) * alpha >> 8)

(s1 * alpha + d1 * 0x100 - d1 * alpha) >> 8

If you substitute s1 with alpha and d1 with dalpha, this gives

(alpha * alpha + dalpha * 0x100 - dalpha * alpha) >> 8

By contrast, we can rearrange

alpha + (dalpha * (alpha ^ 0xFF) >> 8)

(0x100 * alpha + dalpha * 0xFF - dalpha * alpha) >> 8

These are not equivalent, except (almost) when alpha == 0xFF - but that's already been special-cased a few lines above!

Some worked examples:

source 0x0f0f0f0f, destination 0x00000000, SDL1.2 result 0x00000000, SDL2.0 result 0x0f000000

source 0x0f0f0f0f, destination 0xffffffff, SDL1.2 result 0xfff0f0f0, SDL2.0 result 0xfef0f0f0

Ideally, I'd have said that we should be aiming for every colour component to be treated the same, which would result in 0x00000000 and 0xf0f0f0f0 respectively for these examples. (There's also an argument that some rounding should be done rather than simple truncation of the 16-bit intermediate product, but I won't go into that now.)

If I were to re-work the SDL2.0 assembly, would it be acceptable to treat all components the same in this way? I wouldn't have to revisit it yet again if the equation is changed a year or two down the line. I can change BlitRGBtoRGBPixelAlpha()to match this behaviour easily enough, but would need some assistance from someone familiar with MMX and 3dNOW to make those cases match.

Jun 30 '21 21:06 sezero

@slouken, @icculus: This was marked as target-2.0.16 in original bugzilla: Needs you guys' love & care.

Jun 30 '21 22:06 sezero

If I may correct my criticism a bit (sorry, I'm a bit rusty on compositing...)

Treating every colour component the same is only possible when you use what's called premultiplied alpha. This is so common in other libraries that I've worked on that I forgot the situation was different with straight (non-premultiplied) alpha, which from the way BlitRGBtoRGBPixelAlpha() handles the RGB components, is clearly the case here.

For those not familiar with the distinction, there's a good description on Wikipedia. It's the difference between this for premultiplied alpha:

$premultiplied alpha formula$ $premultiplied colour formaula$

and this for straight alpha:

$straight alpha/colour formula$

(Those formulae for α_o are actually equivalent.)

The old SDL 1.2 code is consistent with straight alpha formula for C_o, if you assume α_b = α_o = 1. In other words, they're correct if you take the source image to be ARGB with straight alpha, and the destination image to be XRGB (i.e. assumed opaque, with the most-significant 8 bits ignored). The problem is that SDL 2.0 appears to be applying this function to cases where the most-significant 8 bits of the destination image are also interpreted as alpha, and the only part of the function that has been changed is that for calculating the alpha channel, not the RGB channels.

To be fair, the function name BlitRGBtoRGBPixelAlpha is ambiguous about whether both images had per-pixel alpha or not.

So the questions to be answered are:

Are you open to switching to using premultiplied alpha? This allows correct compositing (let's ignore gamma for now) without an expensive division for every pixel.
If not, is correctness preferred over speed? (i.e. is it OK to include the per-pixel division, or should it be left at least roughly the same as it is at present?)
If we're not switching to premultiplied alpha, and we're also not happy to introduce a division, do you require bit-exact results to the C version? The fact that inverse alpha is calculated in some places by subtracting from 0x100 and in others from 0xFF is ugly, and (at least in NEON) it's possible to apply rounding to the normalisation step for no extra cycle overhead, so both of those could be corrected.

[edit: corrected link to Wikipedia]

Jul 01 '21 00:07 bavison

I think we're out of time for 2.0.18 on this one, so I'm bumping to 2.0.20.

Nov 07 '21 17:11 icculus

Moving to SDL 3.0, to review our alpha blending handling.

@sezero, feel free to look at this sooner, but software rendering on an ARM device isn't really a priority given most devices are set up for hardware acceleration at this point. (Feel free to point out important exceptions!)

Jul 25 '22 21:07 slouken

Moving to SDL 3.0, to review our alpha blending handling.

@sezero, feel free to look at this sooner

It's not my thing at all, it's @bavison's: I just opened the ticket here after his comments.

Jul 25 '22 21:07 sezero

SDL SDL copied to clipboard

ARM assembly to address performance of blit and fill routines

SDL
SDL copied to clipboard