libm Add assembly version of simple operations on aarch64

For aarch64 and arm64ec with Neon, add assembly versions of the following:

ceil
ceilf
fabs
fabsf
floor
floorf
fma
fmaf
round
roundf
sqrt
sqrtf
trunc
truncf

If the fp16 target feature is available, which implies neon, also include the following:

ceilf16
fabsf16
floorf16
rintf16
sqrtf16
truncf16

Additionally, replace core::arch versions of the following with handwritten assembly (which avoids issues with aarch64be):

rint
rintf

Instructions for fmax and fmin are also available but seem to provide different results based on whether NaN inputs are signaling or quiet. Our current implementation does not do this, so omit these for now.

Jan 23 '25 01:01 tgross35

@Amanieu would you mind double checking the assembly in src/math/arch/aarch64.rs? I am unsure whether preserves_flags should be set, I believe some of these operations may set flags based on the exception control register.

Cc @hanna-kruppe, while I was working on the others I also replaced the rint vector implementation.

Jan 23 '25 02:01 tgross35

However I'm then questioning how useful these are on hard-float targets: the standard library will invoke the LLVM intrinsic which will lower to the instruction, so the libm function will never be called. If this is only for compiler-builtins then it might be better to keep libm soft-float only.

At least some of these are used internally within libm by functions that still need to exist on hard-float targets. For example, floor is used by rem_pio2_large which is needed by many trigonometric functions.

(Plus the benefits for non-compiler-builtins consumers, who are not the main point of this crate but it’s still nice-to-have.)

Jan 23 '25 11:01 hanna-kruppe

For the operations that are used internally, the ideal end state that we want is for libm to use the float methods from core, which will then be lowered by LLVM to the appropriate instructions.

Jan 23 '25 15:01 Amanieu

Is there any harm in taking the improvement now and revisiting once those methods are actually available in core?

Jan 23 '25 16:01 hanna-kruppe

My only motivation here is fma - some of the incoming CORE-math routines rely on it, I wanted to have a more accurate icount comparison without soft fma before mul_add is available in core. Nothing else is important, I just included the other simple ops since they are reasonably trivial.

Jan 24 '25 13:01 tgross35

I dropped most of this change but kept:

rint because the SIMD calls are preexisting, this is causing issues with cg_gcc
sqrt and fma because they are used for a lot of other routines. This is mostly for direct users of libm until math in core is stable.

Apr 09 '25 02:04 tgross35

libm libm copied to clipboard

Add assembly version of simple operations on aarch64

libm
libm copied to clipboard