icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Develop Sign, Unsign, Nan & Inf for FixedDecimal

Open younies opened this issue 1 year ago • 4 comments

In many application for units (especially, mixed units), There are a need to represent the numbers as Inf, Neg, Pos ... etc.

Examples:

  1. 3 feet, and 11 inches --> could not be 3 feet and -11 inches
  2. 5:03:04 --> could not be 5:-03:+04
  3. Inf L/100-km
  4. ... etc.

Therefore, we need a way to represent Inf, Neg, Nan ... etc in FixedDecimal

Options for expressing mixed units:

  1. List of FixedUnsignedDecimal
    1. New type; FixedDecimal contains it as an inner field
    2. FixedDecimal gets a generic parameter
    3. Compositional fixed decimal types, like Signed<T>
  2. List of FixedDecimal, take absolute value of each before formatting
  3. List of FixedDecimal, show sign when formatting: 3 feet, -11 inches

younies avatar Jun 15 '24 15:06 younies

  • @younies - Preventing surprising behavior is important. I don't like to mix semantics. If we use FixedDecimal, we have to document this and it can be surpring for the user.
  • @echeran - Mixed units clearly want to be able to use unsigned decimals.
  • @robertbastian - I think FixedDecimal should be unsigned, and we can have a SignedFixedDecimal. In most instances, it makes sense to be unsigned.
  • @Manishearth - We could have Signed<T>
  • @robertbastian - Signed formatting is "extra"; it is always a variant on a positive number format. We could even split the negative patterns into a separate marker that doesn't need to be loaded if you're only doing positive
  • @sffc - FixedDecimal still needs to support NaN and plus and minus infinity. That's an additional superset of the Unsigned fixed decimal. You need that for floating point and u128
  • @echeran - I think the naming and relation aspect makes sense. Whatever is the base is the simpler name. On the other hand, it seems strange that if you do arithmetic, you would not get a negative number.
  • @Manishearth - What does ECMA-402 do?
  • @Manishearth - ECMA-402 doesn't support unit formatting. The closest it gets to is duration formatting. The proposal for duration formatting is at https://tc39.es/proposal-intl-duration-format. It throws an error if the numbers are mixed signs.
  • @younies - About NaN and Inf: in the unit converter, it's possible that we could produce these values.
  • @sffc - Arithmetic is out of scope of FixedDecimal and we should keep it that way
  • @echeran - Between signed and unsigned, if one of those things is a lot more common than the other, then that one should probably get the short name.
  • @Manishearth - In computers, unsigned tends to be more common.
  • @sffc - I can see arguments for various naming conventions. FixedDecimal + SignedFixedDecimal seems okay; also UnsignedFixedDecimal + SignedFixedDecimal. We can bikeshed later.
  • @sffc - What can we do to unblock Kartavya?
  • @robertbastian - Just use FixedDecimal for now, don't look at the sign, pass a sign separately
  • @younies - Agree
  • @sffc - How do we do this composition stuff over FFI?
  • @Manishearth - Probably RefCell. It's unsafe to return &mut of inner fields or to return a mix of & and &mut
  • @sffc - Could we use FFI structs? Where one of the fields is an opaque
  • @Manishearth - The structs can have references. We couldn't have APIs that like parse a string into a Signed<FixedDecimal>.
  • @sffc - Could we have FFI SignedFixedDecimal::into_parts, then you mutate, then you ::from_parts?
  • @Manishearth - If you do that, just do RefCell
  • @sffc - Going all the way over FFI, one extra branch for RefCell is fine.
pub struct Signed<T> {
    pub sign: Sign,
    pub value: T,
}

pub enum WithInfinity<T> {
    Infinity,
    Finite(T),
}

pub enum WithNaN<T> {
    NaN,
    N(T),
}

/// Maybe:
pub struct WithCompactExponent<T> {
    pub exponent: u8,
    pub significand: T,
}

pub struct WithScientificExponent<T> {
    pub exponent: i16,
    pub significand: T,
}

pub type SignedFixedDecimal = Signed<FixedDecimal>;

pub type FixedDecimalOrInfinity = WithInfinity<FixedDecimal>;

pub type SignedFixedDecimalOrInfinity = Signed<FixedDecimalOrInfinity>;

Signed { sign: Neg, value: WithInfinity::Infinity } // \neg\inf

pub type SignedFixedDecimalOrInfinityOrNan = WithNaN<SignedFixedDecimalOrInfinity>;

Over FFI we would have to Concrete proposal:

  • FixedDecimal becomes unsigned
  • Add composition structs and enums as shown above
  • Exact names can be bikeshed later
  • Use RefCell for FFI

LGTM: @younies @echeran @sffc @robertbastian @Manishearth

younies avatar Jun 15 '24 15:06 younies

Probably a duplicate of #862? Though, this has a more fleshed out design, so I'd say it superseeds it in a way.

jedel1043 avatar Jun 28 '24 15:06 jedel1043

Considering the design above, what would be the story around the rounding functions if FixedDecimal becomes unsigned? Do we split the rounding functions into unsigned and signed variants? Some of them are duplicates of each other for unsigned numbers; floor <-> truncate, ceil <-> expand, and all the half-variants of those.

Also, would we add rounding methods to FixedDecimalOrInfinity and SignedFixedDecimalOrInfinityOrNan? Otherwise, it would be really painful having to deconstruct FixedDecimalOrInfinity into its parts to get the sign and unsigned decimal, then construct a SignedFixedDecimal just for rounding purposes, then deconstruct again and reconstruct the SignedFixedDecimalOrInfinityOrNan.

jedel1043 avatar Jun 28 '24 15:06 jedel1043

@younies Would be good to get this into 2.0, but we won't block on it. Please prioritize if you need it

sffc avatar Sep 17 '24 18:09 sffc

2024-10-29 discussion:

  // Option 1
  pub struct FixedInteger(SignedFixedDecimal);
  // this is just:
  pub struct FixedInteger(Signed<UnsignedFixedDecimal>);
  // so we should probably have the `Signed` on the outside?
pub struct Signed<T> {
    decimal: T,
    sign: Sign,
}

// Option 1:
pub struct FixedDecimal {
    integer: Integer,
    upper_magnitude: i16,
    lower_magnitude: i16,
}
pub struct Integer {
    digits: SmallVec<[u8; 8]>,
    // keep this an i16 with the current invariants (largest significant digit)
    magnitude: i16,
}

// Option 2:
pub struct FixedDecimal {
    integer: Integer,
    // number of digits to shift down
    shift: u16,
}
pub struct Integer {
    digits: SmallVec<[u8; 8]>,
    // change the definition of `magnitude` to be u16 of the lowest significant digit
    magnitude: u16,
    upper_magnitude: u16,
}

Example 1

123,000,000,000

Current i16 model (magnitude is most significant digit): digits: [1, 2, 3] magnitude: 11 upper_magnitude: 11

Proposed u16 model (magnitude is least significant digit): digits: [1, 2, 3] magnitude: 9 upper_magnitude: 0

Example 2

0012.3400

Current i16 model: digits: [1, 2, 3, 4] magnitude: 1 upper_magnitude: 3 lower_magnitude: -4

Proposed u16 model: digits: [1, 2, 3, 4] magnitude: 2 upper_magnitude: 6 shift: 4

Game Plan

Current PR: Signed

sffc avatar Oct 29 '24 15:10 sffc

Rounding modes:

UnsignedRoundingMode {
    Expand,
    Trunc,
    HalfExpand,
    HalfTrunc,
    HalfEven
}

SignedRoundingMode {
    Unsigned(UnsignedRoundingMode),
    Ceil,
    Floor,
    HalfCeil,
    HalfFloor
}

sffc avatar Oct 29 '24 15:10 sffc

Another composition idea: make the base be called Natural, a natural number including zero

Then we can eventually end up with:

pub struct Natural {
    digits: SmallVec<[u8; 8]>,
    // magnitude of the lowest significant digit
    magnitude: u16,
    // number of leading zeros (TODO: u8 or u16?)
    leading_zeros: u8,
}

pub struct Decimal {
    natural: Natural,
    // rightward shift of the integer
    shift: u16,
}

Just to note, the equations I think end up being:

  • Most significant displayed digit magnitude: magnitude + digits.len() + leading_zeros - shift - 1
  • Least significant displayed digit magnitude: -shift
  • Most significant nonzero digit magnitude: magnitude + digits.len() - shift - 1
  • Least significant nonzero digit magnitude: magnitude - shift

sffc avatar Oct 30 '24 18:10 sffc

Rounding modes:

UnsignedRoundingMode {
    Expand,
    Trunc,
    HalfExpand,
    HalfTrunc,
    HalfEven
}

SignedRoundingMode {
    Unsigned(UnsignedRoundingMode),
    Ceil,
    Floor,
    HalfCeil,
    HalfFloor
}

I think it's fine to use the same rounding mode enum for both signed and unsigned modes, and the reason is because applying floor or ceil on an unsigned number doesn't do anything unexpected or wrong, it's just a bit superfluous.

jedel1043 avatar Nov 02 '24 13:11 jedel1043

Three reasons why I think we should split the rounding mode enum:

  1. Floor and ceiling are defined to work on real numbers (https://en.wikipedia.org/wiki/Floor_and_ceiling_functions), but an unsigned decimal is not a real number: it represents only half of real numbers, with a cliff on the lower end at 0.
  2. It could be a footgun for people to apply floor or ceiling to an unsigned decimal, especially if they have a sign that they just haven't applied to the number yet. I can see this happening in a builder-type pattern, where maybe you have a decimal, a sign, and a rounding mode coming from three different places. You want to apply the sign before the rounding mode.
  3. The signed rounding modes are directly derivable from the unsigned rounding modes. Their implementations are extremely thin: if the sign is positive, call expand() or halfExpand(), and otherwise call trunc() or halfTrunc(), for example. It is surprising if UnsignedFixedDecimal's floor() or ceil() function was not invoked by the wrapper.

sffc avatar Nov 06 '24 23:11 sffc

an unsigned decimal is not a real number

That's just incorrect. Unsigned decimals are a subset of real numbers.

robertbastian avatar Nov 07 '24 06:11 robertbastian

What I meant was, the domain of real numbers is approximated by a signed decimal, but an unsigned decimal approximates only half of the domain.

sffc avatar Nov 07 '24 07:11 sffc

Floor and ceiling are defined to work on real numbers

Yes, but my original argument is that any function defined in ℝ is also defined in ℝ⁺.

The signed rounding modes are directly derivable from the unsigned rounding modes. Their implementations are extremely thin: if the sign is positive, call expand() or halfExpand(), and otherwise call trunc() or halfTrunc(), for example. It is surprising if UnsignedFixedDecimal's floor() or ceil() function was not invoked by the wrapper.

Implementation details shouldn't really matter at all. If we cared about internal readability, we wouldn't have used the IncrementLike trait — which makes the code harder to understand — to improve binary sizes.

It could be a footgun for people to apply floor or ceiling to an unsigned decimal, especially if they have a sign that they just haven't applied to the number yet. I can see this happening in a builder-type pattern, where maybe you have a decimal, a sign, and a rounding mode coming from three different places. You want to apply the sign before the rounding mode.

This is fair, but I think the same footgun can occur in so many different ways (e.g. round first then call multiplied_pow10) that we should try to make the API easier to use instead of trying to protect the user from small footguns.

jedel1043 avatar Nov 09 '24 13:11 jedel1043

For now, we have implemented two enums: UnsignedRoundingMode and SignedRoundingMode. Should I bring this up in the ICU4X meeting, or should we proceed with using two enums as previously discussed?

younies avatar Nov 13 '24 23:11 younies

Yes, but my original argument is that any function defined in ℝ is also defined in ℝ⁺.

I can see your perspective on that point

This is fair, but I think the same footgun can occur in so many different ways (e.g. round first then call multiplied_pow10) that we should try to make the API easier to use instead of trying to protect the user from small footguns.

Whether or not a function exists on a certain type is the easiest, most effective, and cleanest way to nudge developers to do the right thing. I don't see how missing a function makes the API easier or harder to use. If you try calling .ceil() on an UnsignedFixedDecimal and you get a "function not found" error, then you have an opportunity to learn about whether you should be using either .expand() or whether you should convert to a SignedFixedDecimal first. That seems all good and not bad.

Implementation details shouldn't really matter at all. If we cared about internal readability, we wouldn't have used the IncrementLike trait — which makes the code harder to understand — to improve binary sizes.

I wasn't making an argument about internal readability; I was making an argument about being surprising that UnsignedFixedDecimal::ceil is a leaf function that is not used by SignedFixedDecimal::ceil.

sffc avatar Nov 14 '24 01:11 sffc

Whether or not a function exists on a certain type is the easiest, most effective, and cleanest way to nudge developers to do the right thing.

Yes, I think we shouldn't have a ceil function for UnsignedFixedDecimal, nudging users to other functions if possible, but what I'm actually pushing for is just having a simple flat RoundingMode, because having wrapped enums is a bit clunky to use and a bit harder to abstract around than just having a plain enum.

jedel1043 avatar Nov 14 '24 14:11 jedel1043

One more thing: ECMA-402 actually defines an Unsigned Rounding Mode enumeration

https://tc39.es/ecma402/#sec-getunsignedroundingmode

It uses different names:

  • Infinity
  • Zero
  • Half-Infinity
  • Half-Zero
  • Half-Even

So one option is we could have RoundingMode with the 9 familiar names and UnsignedRoundingMode with these 5 names derived from ECMA-402?

sffc avatar Nov 15 '24 07:11 sffc

@sffc Ohh, that could be nice to use actually! That way, the signed version is not clunky to use but you still have type safety for the unsigned version.

EDIT: Actually, I can envision something nice for signed rounding using this enum: a round_positive_negative_bikeshed method where you can mix and match the rounding type that you want to do to positive numbers and negative numbers, which could be a lot easier to understand than having to learn what floor, ceil or trunc means.

jedel1043 avatar Nov 19 '24 20:11 jedel1043

When we figure this out we should also decide if we wish to rename FixedDecimal over FFI.

Manishearth avatar Jan 02 '25 22:01 Manishearth

  • @sffc - Could we have FFI SignedFixedDecimal::into_parts, then you mutate, then you ::from_parts?
  • @Manishearth - If you do that, just do RefCell

I was wrong, RefCell doesn't solve this problem since you need to have mutation methods on the guard object.

We should either:

  • have SignedFixedDecimal be opaque with a .take() method
  • have SignedFixedDecimal be a diplomat struct with public fields

No strong opinion on which one we should pick

Manishearth avatar Jan 02 '25 22:01 Manishearth

Should we just export a single FixedDecimal type on FFI that does everything, and formatters that want unsigned can return an error if the value is negative?

The composed types are primarily for strictness in Rust. We can do runtime checking when we need to.

sffc avatar Jan 02 '25 22:01 sffc

I think that would be fine! Simplifies our life, and the typedness doesn't get us much.

Manishearth avatar Jan 02 '25 22:01 Manishearth

We also need to decide which formatters correspond to which composite types.

sffc avatar Jan 09 '25 19:01 sffc

A simple optimization is that we can store empty strings in DecimalSymbolStrs for "∞" and "NaN", which means the only cost is the two index bytes that get added to the DecimalSymbolStrs header.

(An alternate solution is to move DecimalSymbolStrs to a VarZeroVec that has two optional tail elements, but we still end up adding a length byte so it's not much of a saving)

Manishearth avatar Jan 09 '25 23:01 Manishearth

@younies I'm going to work on adding data for these strings, and potentially adding formatter types

Manishearth avatar Jan 10 '25 00:01 Manishearth

Some discussion between @sffc and I:

All locales except en-US-posix use ∞. Some locales translate NaN to "not a number" in that language.

My argument is that displaying NaN as "not a number" is actually not what people need: NaN is computer jargon and if NaNs are being surfaced to users, they should be seeing that computer jargon, not some localized, potentially ungoogleable "not a number" string. ICU4X users really can't complain if we don't "localize" NaN: they shouldn't be surfacing them to their users anyway.

CLDR data around this is inconsistent: a random handful of languages have a localized NaN, and it severely impacts DecimalSymbolsV2 deduplication:

-decimal/symbols@2, <lookup>, 1316B, 252 identifiers
-decimal/symbols@2, <total>, 2363B, 991B, 49 unique payloads
+decimal/symbols@2, <lookup>, 1358B, 259 identifiers
+decimal/symbols@2, <total>, 4224B, 2152B, 74 unique payloads

I don't think that inconsistent data is particularly useful for most people: A formatter that can localize NaN for ~24 locales and outputs "NaN" for everyone else seems strange.

So we should hardcode "NaN" and "∞" in the formatter, document this behavior, and if people ask for it, we can produce specific InfinityFormatter/NaNFormatter that uses CLDR data.

Shane overall agrees with this approach.

Manishearth avatar Jan 10 '25 00:01 Manishearth

More discussion about the signature of the format function: We shouldn't have a trait for this because fdf.format(decimal.into()) is common.

We can do one of:

  • format(WithNan<WithInfinity<SignedFixedDecimal>>) and have .into() impls for everything
  • multiple format functions for each level of the stack

Shane is okay with either, has a slight preference for the former.

Manishearth avatar Jan 10 '25 00:01 Manishearth

A few more notes of rationalization:

  • FixedDecimalFormatter::format has long supported signed values. Aside from the fixed_decimal composition work, we would have likely agreed that it was fine for FixedDecimal to gain the ability to represent NaN and Infinity, and then FixedDecimalFormatter::format to gain the ability to format them. So, I think it is the right call to jump to that world and keep the format function the way it is (plus adding Infinity and NaN).
  • I would be okay with a future that included the additional format functions, but I would rather start with a single function, and we can add them if we see call sites that would benefit from them.
  • About localizing NaN, the argument @Manishearth made that convinced me was "imagine you go to a web site and it says, 'this item costs not a number.'" That is less useful than "this item costs NaN", since the latter more clearly indicates that there is a bug in the program. Even if a layperson doesn't know what NaN is, it is identifiable with a quick web search.
  • And even if we wanted to localize NaN, the fact that it doubles the data size of FixedDecimalFormatter is enough for me to say that it should be a standalone formatter. I would even support representing it as a "display name" rather than a number format.

sffc avatar Jan 10 '25 04:01 sffc

idea...

pub struct WithNaN<T>(Option<T>);

impl<T: Writeable> Writeable for WithNaN<T> {
    pub fn write_to(&self, sink) {
        match self {
            Some(x) => x.write_to(),
            None => sink.write_str("NaN"),
        }
    }
}

And then the signature of FixedDecimalFormatter

impl FixedDecimalFormatter {
    pub fn format(input: WithNaN<SignedFixedDecimal>) -> WithNaN<FormattedFixedDecimal> { ... }
}

Or maybe keep format the way it is now and make the above function named format_with_nan.

Need to add Infinity support, too.

sffc avatar Jan 10 '25 23:01 sffc

I've seen the type WithNan<WithInfinity<SignedFixedDecimal>> mentioned a couple of times, but if we want to support negative infinity, it would have to be WithNan<Signed<WithInfinity<UnsignedFixedDecimal>>>, right?

jedel1043 avatar Jan 11 '25 00:01 jedel1043

@sffc not really in favor of reusing the type as a writeable.

@jedel1043 yes, it would

Manishearth avatar Jan 11 '25 00:01 Manishearth