num-traits icon indicating copy to clipboard operation
num-traits copied to clipboard

FR: Implement Saturating ops for floating point numbers

Open Dietr1ch opened this issue 6 months ago • 3 comments

This threw me off because floats saturate at ±INF, but I'm not sure if subtleties around NaNs are what's keeping Saturation Operations from being implemented for f32 and f64.

Dietr1ch avatar May 28 '25 19:05 Dietr1ch

because floats saturate at ±INF

I'm not sure I agree -- infinity is more like an overflow in my mind. The saturated integer values are still finite (of course), but once you hit a float infinity you're kind of stuck there.

cuviper avatar May 29 '25 17:05 cuviper

because floats saturate at ±INF

I'm not sure I agree -- infinity is more like an overflow in my mind. The saturated integer values are still finite (of course), but once you hit a float infinity you're kind of stuck there.

Yes, this is true, but saturating at f32::MAX (3.4028235e38) would be way more surprising IMO. I think that if I told someone that floats were saturating they'd think of infinities as the sinks.

The problem where I'm coming from is summing costs while defending myself from overflows, which makes me prefer saturating_add and treat T::MAX as infinity since I can't really tell if the sum was exactly that value, or if I lost it's value.

I guess this perspective is different if you just want saturating in things like colours, where u8::MAX is just maxing out the channel, but still pure red/green/blue/white since you can't physically go further, but here still you might find artifacts after hitting the limit (I know this happens on audio and is why clipping is bad)


I thought of adding these tests that reflect on why floats are kind of saturating, but also kind of weird.

    #[test]
    fn floats_are_kind_of_saturating() {
        let f = 100.0f32;
        // You get capped at +∞ when adding
        assert_eq!(saturating_add(f, INF), INF);
        // You get capped at ±∞ when substracting
        assert_eq!(saturating_sub(f, INF), -INF);

        // You get capped at ±∞ when multiplying.
        // And you still get the appropriate sign.
        assert_eq!(saturating_mul(f, INF), INF);
        assert_eq!(saturating_mul(-f, INF), -INF);
    }

    #[test]
    fn float_are_kind_of_weird_saturation_is_sticky() {
        let f = 100.0f32;

        // You stay "stuck" at ±∞ adding
        assert_eq!(saturating_add(INF, f), INF);
        assert_eq!(saturating_add(INF, -f), INF);
        assert_eq!(saturating_add(-INF, f), -INF);
        assert_eq!(saturating_add(-INF, -f), -INF);
        // You stay "stuck" at ±∞ substracting
        assert_eq!(saturating_sub(INF, f), INF);
        assert_eq!(saturating_sub(INF, -f), INF);
        assert_eq!(saturating_sub(-INF, f), -INF);
        assert_eq!(saturating_sub(-INF, -f), -INF);
        // You stay "stuck" at ±∞ when multiplying.
        assert_eq!(saturating_mul(INF, f), INF);
        assert_eq!(saturating_mul(INF, INF), INF);
        // But keep the right sign
        assert_eq!(saturating_mul(INF, -f), -INF);
        assert_eq!(saturating_mul(-INF, f), -INF);
        assert_eq!(saturating_mul(-INF, -f), INF);
        assert_eq!(saturating_mul(INF, -INF), -INF);
        assert_eq!(saturating_mul(-INF, -INF), INF);
    }

    #[test]
    fn float_are_kind_of_weird_nan() {
        // They bail out on comparing infinites and give you a NaN with details.
        assert!(saturating_add(INF, -INF).is_nan());
        assert!(saturating_add(-INF, INF).is_nan());

        assert!(saturating_sub(INF, INF).is_nan());
        assert!(saturating_sub(-INF, -INF).is_nan());

        assert!(saturating_mul(INF, 0.0).is_nan());
        assert!(saturating_mul(-INF, 0.0).is_nan());
    }

    #[test]
    fn float_are_kind_of_weird_max() {
        assert!(f32::MAX < f32::INFINITY);
    }

Dietr1ch avatar May 29 '25 18:05 Dietr1ch

Maybe looking at it from a different perspective, would it make sense to derive Saturating operations for all Bounded+Add/Sub/Mul types instead of only for u8-u128,usize and i8-i128,isize?

pub trait SaturatingAdd: Sized + Add<Self, Output = Self> {
    /// Saturating addition. Computes self + other, saturating at the relevant high or low boundary of the type.
    fn saturating_add(&self, v: &Self) -> Self;
}
pub trait Bounded {
    /// Returns the smallest --finite-- number this type can represent
    fn min_value() -> Self;
    /// Returns the largest --finite-- number this type can represent
    fn max_value() -> Self;
}

Now the question becomes why are f32 and f64 Bounded while in the mathematical sense they are unbounded? It seems it's only because of the additional constraint of being finite, but this makes some_f32 > f32::max_value() not completely false and makes trying to relate the Bounded trait with Bounded numbers weird.

I also find that f32::MAX < f32::INFINITY shows that Rust 1.43 chose to make things weird already and there's no way to fix it in any library. I guess that MAX_FINITE seemed too much typing at the time and that aliasing f32::MAX to f32::INFINITE seemed silly.

Dietr1ch avatar Jun 02 '25 03:06 Dietr1ch