libm icon indicating copy to clipboard operation
libm copied to clipboard

Add `fmaf16`

Open tgross35 opened this issue 11 months ago • 7 comments

Split from https://github.com/rust-lang/libm/pull/390 since I think the f128 version will be trickier.

tgross35 avatar Jan 11 '25 21:01 tgross35

@beetrees or @tspiteri would you be able to help me figure this out? The version of fmaf16 in this PR says fma(0x0001, 0x3bf7, 0x0000) = 0x0001 (6E-8 * 0.9956 + 0 = 6E-8, same result as x * y), which agrees with what LLVM does via fmaf. However, Rug is saying the result should be 0x0000. Is this usage incorrect?

#![feature(f16)]

use az::Az;
use rug::float::Round;

fn main() {
    let x = f16::from_bits(0x0001);
    let y = f16::from_bits(0x3bf7);
    let z = f16::from_bits(0x0000);

    let mut xf = rug::Float::with_val(11, x);
    let yf = rug::Float::with_val(11, y);
    let zf = rug::Float::with_val(11, z);

    let ordering = xf.mul_add_round(&yf, &zf, Round::Nearest);
    xf.subnormalize_ieee_round(ordering, Round::Nearest);
    let rug_res = (&xf).az::<f16>();

    println!("rug: {xf} {rug_res:?}");
    println!("std: {:?}", x.mul_add(y, z));
}

Prints:

rug: 5.9343e-8 0x0000
std: 0x0001

Also checked against Julia to get a third opinion, which agrees with 0x0001.

> x = fma(reinterpret(Float16, 0x0001), reinterpret(Float16, 0x3bf7), reinterpret(Float16, 0x0000))
Float16(6.0e-8)
> reinterpret(UInt16, x)
0x0001

tgross35 avatar Jan 22 '25 23:01 tgross35

It looks like what is happening there is that subnormalize{_ieee}{_round} propagate values smaller than the smallest subnormal unmodified, which is very unintuitive, but is also more or less behaving as documented. The docs of the _ieee -variants aren't explicit about what the subnormal range is, and I would have expected that to be just any exponent small enough.

I created an issue: https://gitlab.com/tspiteri/rug/-/issues/78

quaternic avatar Jan 23 '25 03:01 quaternic

Great find, thanks for looking into it. This is a problem with subnormalize then right, not specifically the IEEE versions?

From a quick check it seems like this could be worked around by reducing and reexpanding the precision, which shouldn't reallocate. Any idea if there is a better way?

let mut xf = rug::Float::with_val(24, f32::from_bits(1));
xf *= 0.75;
subnormalize(&mut xf);
xf *= 16.0;
subnormalize(&mut xf);
assert_eq!(xf.to_f32(), f32::from_bits(1) * 0.75 * 16.0);

fn subnormalize(xf: &mut rug::Float) {
    let e = xf.get_exp().unwrap();
    if e < -126 {
        xf.set_prec(24_u32.saturating_sub(-e as u32 - 126));
        xf.set_prec(24);
    }
}

tgross35 avatar Jan 23 '25 03:01 tgross35

The subnormalize_round was needed in the first place because it avoids double rounding by taking as input the direction that the Float was previously rounded. Any workaround will need to handle that as well. I'll add a specific test case to the rug issue.

Yes, all the variants of subnormalize have the same quirk, but the general ones (without _ieee) document it more explicitly.

quaternic avatar Jan 23 '25 05:01 quaternic

After the change mentioned in Rug issue 78 which is now in the master branch, the code above prints:

rug: 5.9605e-8 0x0001
std: 0x0001

tspiteri avatar Jan 23 '25 16:01 tspiteri

Thank you for the quick fix, it seems to work great! The failure in the extensive tests looks like a real failure (interestingly, Julia returns the same result though std's implementation returns the MPFR result).

tgross35 avatar Jan 23 '25 20:01 tgross35

---- mp_extensive_fmaf16 ----

    input:    (0x8f10, 0x0488, 0x83e8) (0x8f10, 0x0488, 0x83e8)
    expected: 0x83e9 0x83e9
    actual:   0x83e8 0x83e8

Caused by:
    ulp 1 > 0

Indeed the expected result is correct. This is a case of a * b + c where:

  • c is subnormal
  • a * b is just slightly larger than half the least subnormal
  • a * b + c, when computed in f32, rounds to the halfway point between c = 0x83e8 and the next representable number 0x83e9

I'm not entirely sure where exactly the algorithm goes wrong, but I think it is because the test for halfway cases doesn't account for the sum being subnormal, so the f32-result (which is not subnormal) has even more excess precision.

quaternic avatar Jan 23 '25 23:01 quaternic

Recreated as https://github.com/rust-lang/compiler-builtins/pull/861

tgross35 avatar Apr 20 '25 04:04 tgross35