libass
libass copied to clipboard
Discuss: -ffast-math and other math optimization flags
This was brought up in https://github.com/libass/libass/pull/780#issuecomment-2148145960.
-ffast-math is shorthand for several flags:
-fno-math-errno(we decided in #780 that we definitely want this)-funsafe-math-optimizations, which is shorthand for:-fno-signed-zeros(probably fine?)-fno-trapping-math(unclear what this would mean for us?)-fassociative-math(unsure)-freciprocal-math(unsure)
-ffinite-math-only(we have a coupleisnans; not sure if this is safe wrt those?)-fno-rounding-math(this is actually the default anyway?)-fno-signaling-nans(this is also the default)-fcx-limited-range(we don't use complex numbers)-fexcess-precision=fast(this is the default)
So this leaves -fno-signed-zeros, -fno-trapping-math, -fassociative-math, -freciprocal-math, and -ffinite-math-only up for discussion. One simple test would be to build with and without each flag, and see what code changes and what perf impact they have.
I think if we have cases where -fassociative-math or -freciprocal-math help (and are safe), we should probably be explicitly adjusting our floating-point code to allow the compiler to produce optimal output even without those flags.
I think most if not all flags would be safe to use. For -ffinite-math-only it mostly means we have to be careful about dividing by zero, which is most common case of producing nans. Either way, library like libass hardly needs a IEEE compliant floats. Even -ffast-math will probably work in most of the cases, though from experience it is generally better to avoid it, unless we can with confidence say that libass code is good for this. Better start little by little and enable only safe(r) flags. -ffast-math can be deceptive, because it gives nice performance boost and things look fine at first sight, but after a while you start hitting corner cases.
-fno-signed-zeros is safe to enable
-fno-trapping-math probably also safe, I'm not familiar with libass codebase, but I imagine it is not needed.
-fassociative-math and -freciprocal-math will increase error and start producing floats that are bit off in some cases. Questions is if it is a problem. I would be careful with those, but still probably nothing catastrophic will not happen.
-funsafe-math-optimizations looks safe to me. We don't use float hacks, only treat them as approximation for real numbers.
-ffinite-math-only: for isnan() in dtoi32() it's possible to rearrange condition without it:
if (!(val > INT32_MIN && val < INT32_MAX + 1LL))
And I think that isnan() in ass_outline_update_min_transformed_x() can be triggered only for broken matrix with NaNs. In that case the exact value of min_x doesn't matter due to subsequent failure in quantize_transform(), so we can replace FFMINMAX with some code to ensure there's no UB in lrint() call.
I think both of our uses of isnan are impossible to hit in practice (ass_strtod can't return NaN); however, ass_strtod can probably return infinity, and we want to catch that.
The docs on -ffinite-math-only are somewhat unclear, but I've just compared clang's output with it vs without when building libass for arm64:
The most common change is that fmax is replaced with fmaxnm (which is identical apart from NaN behavior); same with fmin/fminnm. Not sure why the compiler's so eager to make that change, but sure.
The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:
<_change_alpha>:
ucvtf d1, w1
ldr b2, [x0]
ucvtf d2, d2
fmov d3, #1.00000000
fsub d3, d3, d0
fmul d0, d1, d0
fmadd d0, d3, d2, d0
mov x8, #-0x3e20000000000000
fmov d1, x8
fcmp d0, d1
fccmp d0, d0, #0x1, hi
mov x8, #0x41e0000000000000
fmov d1, x8
fccmp d0, d1, #0x0, vc
fcvtzs w8, d0
csel w8, wzr, w8, ge
strb w8, [x0]
ret
->
<_change_alpha>:
stp d9, d8, [sp, #-0x30]!
stp x20, x19, [sp, #0x10]
stp x29, x30, [sp, #0x20]
add x29, sp, #0x20
mov x19, x0
ldr w8, [x0]
and w20, w8, #0xffffff00
and w8, w8, #0xff
ucvtf d1, w1
ucvtf d2, w8
fmov d3, #1.00000000
fsub d3, d3, d0
fmul d0, d1, d0
fmadd d8, d3, d2, d0
fmov d0, d8
bl 0x2a09c [rcombs note: symbol stub for isnand]
mov x8, #0x41e0000000000000
fmov d0, x8
fcmp d8, d0
mov x8, #-0x3e20000000000000
fmov d0, x8
fccmp d8, d0, #0x4, lt
ccmp w0, #0x0, #0x0, gt
fcvtzs w8, d8
and w8, w8, #0xff
csel w8, wzr, w8, ne
orr w8, w8, w20
str w8, [x19]
ldp x29, x30, [sp, #0x20]
ldp x20, x19, [sp, #0x10]
ldp d9, d8, [sp], #0x30
ret
I also saw a few changes like this, which seems like an actual positive improvement (on Firestorm, fcmp+fcsel is 7 cycles, vs 2 cycles for fmax[nm]):
- fcmp d0, d6
- fcsel d0, d0, d6, gt
+ fmaxnm d0, d0, d6
However, this same improvement can be had with a change like this:
diff --git a/libass/ass_utils.h b/libass/ass_utils.h
index a1beb769..ebc44e78 100644
--- a/libass/ass_utils.h
+++ b/libass/ass_utils.h
@@ -43,8 +43,17 @@
#define MSGL_V 6
#define MSGL_DBG2 7
-#define FFMAX(a,b) ((a) > (b) ? (a) : (b))
-#define FFMIN(a,b) ((a) > (b) ? (b) : (a))
+#define FFMAX(a, b) _Generic(((a)+(b)), \
+ double: fmax((a), (b)), \
+ float: fmaxf((a), (b)), \
+ default: ((a) > (b) ? (a) : (b)) \
+)
+#define FFMIN(a, b) _Generic(((a)+(b)), \
+ double: fmin((a), (b)), \
+ float: fminf((a), (b)), \
+ default: ((a) > (b) ? (b) : (a)) \
+)
+
#define FFMINMAX(c,a,b) FFMIN(FFMAX(c, a), b)
#define ASS_PI 3.14159265358979323846
re -ffinite-math-only we have e.g. https://github.com/libass/libass/issues/447#issuecomment-719155503 where NaNs are apparently relied upon to detect bogus values and there might have been more such cases (-fsanitize=float-divide-by-zero is intentionally omitted from CI atm due to this). This doesn't seem safe to enable rn
Except errno and traps, whether or not we rely on standard-compliant behaviour isn't that easy to search for. I think there was also some discussion on whether or not to rely on IEEE float behaviour to optimise something in #483
The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:
Are you sure you not flipped the results?
-fno-finite-math-only
change_alpha:
ucvtf d31, w1
ldr w1, [x0]
fmov d30, 1.0e+0
and w2, w1, 255
and w1, w1, -256
fsub d30, d30, d0
fmul d0, d31, d0
scvtf d31, w2
fmadd d0, d31, d30, d0
fcmp d0, d0
bvs .L2
mov x2, -4476578029606273024
fmov d31, x2
fcmpe d0, d31
bls .L2
mov x2, 4746794007248502784
fmov d31, x2
fcmpe d0, d31
bge .L2
fcvtzs w2, d0
and w2, w2, 255
orr w1, w1, w2
.L2:
str w1, [x0]
ret
-fno-finite-math-only
change_alpha:
ucvtf d31, w1
mov x2, -4476578029606273024
ldr w1, [x0]
fmov d29, x2
mov x2, 4746794007248502784
fmov d28, 1.0e+0
fmov d30, x2
and w2, w1, 255
fsub d28, d28, d0
fmul d0, d31, d0
scvtf d31, w2
and w1, w1, -256
mov w3, w1
fmadd d0, d31, d28, d0
fcmpe d0, d29
fcvtzs w2, d0
fccmpe d0, d30, 0, hi
and w2, w2, 255
orr w1, w1, w2
csel w1, w3, w1, ge
str w1, [x0]
ret
However, this same improvement can be had with a change like this:
One thing is to show the intent to compiler and another is to allow its do the job. Indeed for floating point values, it is generally better to use fmax, but in practice compiler should recognize the pattern.
Are you sure you not flipped the results?
I thought I might've, but I double-checked!
in practice compiler should recognize the pattern.
The pattern as written (((a) > (b) ? (a) : (b))) is not permitted to compile to fmax if a or b may be NaN: the existing macro must return b if a is NaN, but fmax returns NaN. It can't use fmaxnm either: if b is NaN, it must return b, but fmaxnm would return a if non-NaN.
On x86 at least it is compiled directly to
double foo(double a, double b) {
return ((a) > (b) ? (a) : (b));
}
foo: # @foo
vmaxsd xmm0, xmm0, xmm1
ret
If it indeed is not universal for other platforms to have similar behavior, than indeed using fmax directly would help.
vmaxsd is more amenable to this optimization:
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
This is exactly the behavior of ((a) > (b) ? (a) : (b)), so we get the single instruction. ARM doesn't have a max instruction that prefers the second operand like that, so we're better-off using fmax there. However, looking deeper, I see that fmax doesn't always get optimized down to maxsd on x86, so we are better-off using the conditional there; if we add the generic, we should make the fmax paths platform-conditional.
Seeing as this has been brought up again in #806, FWIW I continue to think that even NaNs don’t actually matter to libass and we could/should just be using -ffast-math etc., as I remember MrSmile saying for years he’s been doing this on his own machine. Where we do have NaN checks in our code, they’re purely/mainly defensive. Where NaN semantics have been brought up in earlier discussions, the context has been exceptional situations that shouldn’t happen to begin with and NaNs merely didn’t make matters even worse.
Of course, an abundance of caution is always good and a careful audit of the relevant code before the switch is flipped won’t hurt. But I don’t think “libass depends on NaNs for correctness” is an accurate statement: I don’t think we ever intended to do this, and we should have no reason to do so.
But I don’t think “libass depends on NaNs for correctness” is an accurate statement
I agree. I only mentioned it before, because there are isnan checks, which would be noop after in "fast" mode, so for completeness this part would need to changes, that's all.
I’ve been linked (in an unrelated context) to this article, and it claims/points out that -ffast-math apparently makes shared libraries enable subnormal truncation when loaded into a process even if that process itself has -fno-fast-math. Which, of course, is a very wrong thing to do. If this is true, then we certainly don’t want to use this flag.
If this is true
It is/was true, but since gcc 13 and clang 19 the startup code is omitted when linking the final binary with -shared. See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 and https://github.com/llvm/llvm-project/issues/57589
I suspect the compiled code will still assume the environment matches what the the startup code would create, which if not true might lead to new bugs in the -ffast-math code.
As I said before I believe most of these options are not predictable enough to enable by default and this seems to further confirm this. In particular, our existing code definitely cannot work with -ffinite-math-only. Potentially sufficiently safe options for us, other than compiler defaults, being:
-fno-math-errno(already enabled; surprisingly also affectsmalloc’serrnobut we’re not cheking this anyway)-fno-trapping-math-fassociative-math(unless our rasterization or blur algorithm relies on compensated arithmetic?)
OMG, whoever decided that disabling denormals is a good "optimization" was not in the right mind. Yes, it's fast (on badly designed hardware), but I can't call potential loss of precision up to 2^52 ULP as optimization. Especially if it's done by modifying the global state. I should remove -Ofast from my other projects too. Hope somebody introduces -fsane-math flag without such shenanigans...
unless our rasterization or blur algorithm relies on compensated arithmetic?
No, there shouldn't be bithacks for FP math anywhere, AFAIK.
unless our rasterization or blur algorithm relies on compensated arithmetic?
No, there shouldn't be bithacks for FP math anywhere, AFAIK.
Compensated arithmetic isn’t about bithacks, but about ways to track and later compensate arithmetic precision error. With associative math determining the compensation will most likely fail (optimised away to 0). For example as a simple example: compensated summation