libass Discuss: -ffast-math and other math optimization flags

This was brought up in https://github.com/libass/libass/pull/780#issuecomment-2148145960.

-ffast-math is shorthand for several flags:

-fno-math-errno (we decided in #780 that we definitely want this)
-funsafe-math-optimizations, which is shorthand for:
- -fno-signed-zeros (probably fine?)
- -fno-trapping-math (unclear what this would mean for us?)
- -fassociative-math (unsure)
- -freciprocal-math (unsure)
-ffinite-math-only (we have a couple isnans; not sure if this is safe wrt those?)
-fno-rounding-math (this is actually the default anyway?)
-fno-signaling-nans (this is also the default)
-fcx-limited-range (we don't use complex numbers)
-fexcess-precision=fast (this is the default)

So this leaves -fno-signed-zeros, -fno-trapping-math, -fassociative-math, -freciprocal-math, and -ffinite-math-only up for discussion. One simple test would be to build with and without each flag, and see what code changes and what perf impact they have.

I think if we have cases where -fassociative-math or -freciprocal-math help (and are safe), we should probably be explicitly adjusting our floating-point code to allow the compiler to produce optimal output even without those flags.

Jun 04 '24 18:06 rcombs

I think most if not all flags would be safe to use. For -ffinite-math-only it mostly means we have to be careful about dividing by zero, which is most common case of producing nans. Either way, library like libass hardly needs a IEEE compliant floats. Even -ffast-math will probably work in most of the cases, though from experience it is generally better to avoid it, unless we can with confidence say that libass code is good for this. Better start little by little and enable only safe(r) flags. -ffast-math can be deceptive, because it gives nice performance boost and things look fine at first sight, but after a while you start hitting corner cases.

-fno-signed-zeros is safe to enable -fno-trapping-math probably also safe, I'm not familiar with libass codebase, but I imagine it is not needed.

-fassociative-math and -freciprocal-math will increase error and start producing floats that are bit off in some cases. Questions is if it is a problem. I would be careful with those, but still probably nothing catastrophic will not happen.

Jun 04 '24 20:06 kasper93

-funsafe-math-optimizations looks safe to me. We don't use float hacks, only treat them as approximation for real numbers.

-ffinite-math-only: for isnan() in dtoi32() it's possible to rearrange condition without it:

if (!(val > INT32_MIN && val < INT32_MAX + 1LL))

And I think that isnan() in ass_outline_update_min_transformed_x() can be triggered only for broken matrix with NaNs. In that case the exact value of min_x doesn't matter due to subsequent failure in quantize_transform(), so we can replace FFMINMAX with some code to ensure there's no UB in lrint() call.

Jun 04 '24 21:06 MrSmile

I think both of our uses of isnan are impossible to hit in practice (ass_strtod can't return NaN); however, ass_strtod can probably return infinity, and we want to catch that.

The docs on -ffinite-math-only are somewhat unclear, but I've just compared clang's output with it vs without when building libass for arm64:

The most common change is that fmax is replaced with fmaxnm (which is identical apart from NaN behavior); same with fmin/fminnm. Not sure why the compiler's so eager to make that change, but sure.

The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:

<_change_alpha>:
                ucvtf d1, w1
                ldr b2, [x0]
                ucvtf d2, d2
                fmov  d3, #1.00000000
                fsub  d3, d3, d0
                fmul  d0, d1, d0
                fmadd d0, d3, d2, d0
                mov x8, #-0x3e20000000000000
                fmov  d1, x8
                fcmp  d0, d1
                fccmp d0, d0, #0x1, hi
                mov x8, #0x41e0000000000000
                fmov  d1, x8
                fccmp d0, d1, #0x0, vc
                fcvtzs  w8, d0
                csel  w8, wzr, w8, ge
                strb  w8, [x0]
                ret

->

<_change_alpha>:
                stp d9, d8, [sp, #-0x30]!
                stp x20, x19, [sp, #0x10]
                stp x29, x30, [sp, #0x20]
                add x29, sp, #0x20
                mov x19, x0
                ldr w8, [x0]
                and w20, w8, #0xffffff00
                and w8, w8, #0xff
                ucvtf d1, w1
                ucvtf d2, w8
                fmov  d3, #1.00000000
                fsub  d3, d3, d0
                fmul  d0, d1, d0
                fmadd d8, d3, d2, d0
                fmov  d0, d8
                bl  0x2a09c [rcombs note: symbol stub for isnand]
                mov x8, #0x41e0000000000000
                fmov  d0, x8
                fcmp  d8, d0
                mov x8, #-0x3e20000000000000
                fmov  d0, x8
                fccmp d8, d0, #0x4, lt
                ccmp  w0, #0x0, #0x0, gt
                fcvtzs  w8, d8
                and w8, w8, #0xff
                csel  w8, wzr, w8, ne
                orr w8, w8, w20
                str w8, [x19]
                ldp x29, x30, [sp, #0x20]
                ldp x20, x19, [sp, #0x10]
                ldp d9, d8, [sp], #0x30
                ret

I also saw a few changes like this, which seems like an actual positive improvement (on Firestorm, fcmp+fcsel is 7 cycles, vs 2 cycles for fmax[nm]):

-                       fcmp    d0, d6
-                       fcsel   d0, d0, d6, gt
+                       fmaxnm  d0, d0, d6

However, this same improvement can be had with a change like this:

diff --git a/libass/ass_utils.h b/libass/ass_utils.h
index a1beb769..ebc44e78 100644
--- a/libass/ass_utils.h
+++ b/libass/ass_utils.h
@@ -43,8 +43,17 @@
 #define MSGL_V 6
 #define MSGL_DBG2 7
 
-#define FFMAX(a,b) ((a) > (b) ? (a) : (b))
-#define FFMIN(a,b) ((a) > (b) ? (b) : (a))
+#define FFMAX(a, b) _Generic(((a)+(b)), \
+     double: fmax((a), (b)), \
+      float: fmaxf((a), (b)),  \
+    default: ((a) > (b) ? (a) : (b)) \
+)
+#define FFMIN(a, b) _Generic(((a)+(b)), \
+     double: fmin((a), (b)), \
+      float: fminf((a), (b)),  \
+    default: ((a) > (b) ? (b) : (a)) \
+)
+
 #define FFMINMAX(c,a,b) FFMIN(FFMAX(c, a), b)
 
 #define ASS_PI 3.14159265358979323846

Jun 04 '24 22:06 rcombs

re -ffinite-math-only we have e.g. https://github.com/libass/libass/issues/447#issuecomment-719155503 where NaNs are apparently relied upon to detect bogus values and there might have been more such cases (-fsanitize=float-divide-by-zero is intentionally omitted from CI atm due to this). This doesn't seem safe to enable rn

Except errno and traps, whether or not we rely on standard-compliant behaviour isn't that easy to search for. I think there was also some discussion on whether or not to rely on IEEE float behaviour to optimise something in #483

Jun 04 '24 22:06 TheOneric

The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:

Are you sure you not flipped the results?

-fno-finite-math-only

change_alpha:
        ucvtf   d31, w1
        ldr     w1, [x0]
        fmov    d30, 1.0e+0
        and     w2, w1, 255
        and     w1, w1, -256
        fsub    d30, d30, d0
        fmul    d0, d31, d0
        scvtf   d31, w2
        fmadd   d0, d31, d30, d0
        fcmp    d0, d0
        bvs     .L2
        mov     x2, -4476578029606273024
        fmov    d31, x2
        fcmpe   d0, d31
        bls     .L2
        mov     x2, 4746794007248502784
        fmov    d31, x2
        fcmpe   d0, d31
        bge     .L2
        fcvtzs  w2, d0
        and     w2, w2, 255
        orr     w1, w1, w2
.L2:
        str     w1, [x0]
        ret

-fno-finite-math-only

change_alpha:
        ucvtf   d31, w1
        mov     x2, -4476578029606273024
        ldr     w1, [x0]
        fmov    d29, x2
        mov     x2, 4746794007248502784
        fmov    d28, 1.0e+0
        fmov    d30, x2
        and     w2, w1, 255
        fsub    d28, d28, d0
        fmul    d0, d31, d0
        scvtf   d31, w2
        and     w1, w1, -256
        mov     w3, w1
        fmadd   d0, d31, d28, d0
        fcmpe   d0, d29
        fcvtzs  w2, d0
        fccmpe  d0, d30, 0, hi
        and     w2, w2, 255
        orr     w1, w1, w2
        csel    w1, w3, w1, ge
        str     w1, [x0]
        ret

However, this same improvement can be had with a change like this:

One thing is to show the intent to compiler and another is to allow its do the job. Indeed for floating point values, it is generally better to use fmax, but in practice compiler should recognize the pattern.

Jun 05 '24 00:06 kasper93

Are you sure you not flipped the results?

I thought I might've, but I double-checked!

in practice compiler should recognize the pattern.

The pattern as written (((a) > (b) ? (a) : (b))) is not permitted to compile to fmax if a or b may be NaN: the existing macro must return b if a is NaN, but fmax returns NaN. It can't use fmaxnm either: if b is NaN, it must return b, but fmaxnm would return a if non-NaN.

Jun 05 '24 00:06 rcombs

On x86 at least it is compiled directly to

double foo(double a, double b) {
    return ((a) > (b) ? (a) : (b));
}

foo:                                    # @foo
        vmaxsd  xmm0, xmm0, xmm1
        ret

If it indeed is not universal for other platforms to have similar behavior, than indeed using fmax directly would help.

Jun 05 '24 01:06 kasper93

vmaxsd is more amenable to this optimization:

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.

This is exactly the behavior of ((a) > (b) ? (a) : (b)), so we get the single instruction. ARM doesn't have a max instruction that prefers the second operand like that, so we're better-off using fmax there. However, looking deeper, I see that fmax doesn't always get optimized down to maxsd on x86, so we are better-off using the conditional there; if we add the generic, we should make the fmax paths platform-conditional.

Jun 05 '24 01:06 rcombs

Seeing as this has been brought up again in #806, FWIW I continue to think that even NaNs don’t actually matter to libass and we could/should just be using -ffast-math etc., as I remember MrSmile saying for years he’s been doing this on his own machine. Where we do have NaN checks in our code, they’re purely/mainly defensive. Where NaN semantics have been brought up in earlier discussions, the context has been exceptional situations that shouldn’t happen to begin with and NaNs merely didn’t make matters even worse.

Of course, an abundance of caution is always good and a careful audit of the relevant code before the switch is flipped won’t hurt. But I don’t think “libass depends on NaNs for correctness” is an accurate statement: I don’t think we ever intended to do this, and we should have no reason to do so.

Aug 02 '24 19:08 astiob

But I don’t think “libass depends on NaNs for correctness” is an accurate statement

I agree. I only mentioned it before, because there are isnan checks, which would be noop after in "fast" mode, so for completeness this part would need to changes, that's all.

Aug 02 '24 21:08 kasper93

I’ve been linked (in an unrelated context) to this article, and it claims/points out that -ffast-math apparently makes shared libraries enable subnormal truncation when loaded into a process even if that process itself has -fno-fast-math. Which, of course, is a very wrong thing to do. If this is true, then we certainly don’t want to use this flag.

Feb 03 '25 23:02 astiob

If this is true

It is/was true, but since gcc 13 and clang 19 the startup code is omitted when linking the final binary with -shared. See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 and https://github.com/llvm/llvm-project/issues/57589

I suspect the compiled code will still assume the environment matches what the the startup code would create, which if not true might lead to new bugs in the -ffast-math code.

As I said before I believe most of these options are not predictable enough to enable by default and this seems to further confirm this. In particular, our existing code definitely cannot work with -ffinite-math-only. Potentially sufficiently safe options for us, other than compiler defaults, being:

-fno-math-errno (already enabled; surprisingly also affects malloc’s errno but we’re not cheking this anyway)
-fno-trapping-math
-fassociative-math (unless our rasterization or blur algorithm relies on compensated arithmetic?)

Feb 04 '25 17:02 TheOneric

OMG, whoever decided that disabling denormals is a good "optimization" was not in the right mind. Yes, it's fast (on badly designed hardware), but I can't call potential loss of precision up to 2^52 ULP as optimization. Especially if it's done by modifying the global state. I should remove -Ofast from my other projects too. Hope somebody introduces -fsane-math flag without such shenanigans...

unless our rasterization or blur algorithm relies on compensated arithmetic?

No, there shouldn't be bithacks for FP math anywhere, AFAIK.

Feb 04 '25 18:02 MrSmile

unless our rasterization or blur algorithm relies on compensated arithmetic?

No, there shouldn't be bithacks for FP math anywhere, AFAIK.

Compensated arithmetic isn’t about bithacks, but about ways to track and later compensate arithmetic precision error. With associative math determining the compensation will most likely fail (optimised away to 0). For example as a simple example: compensated summation

Feb 04 '25 18:02 TheOneric

libass libass copied to clipboard

Discuss: -ffast-math and other math optimization flags

libass
libass copied to clipboard