icu4x Division of concerns and scientific notation in FixedDecimal

Division of concerns and scientific notation in FixedDecimal

Open sffc opened this issue 2 years ago • 14 comments

FixedDecimal is defined as "a core API for representing numbers in a human-readable form appropriate for formatting and plural rule selection". Currently, we support decimal numbers with leading zeros and trailing zeros.

The following two numbers are the same quantity, but they differ in the way they are presented to humans: "1200" and "1.2E3". #1265 added scientific notation parsing support in FixedDecimal::from_str, but without storing that information in the data model of FixedDecimal. The subject of this issue is my assertion that we need to also add "visible exponent" to the data model of FixedDecimal in order to express this difference.

Isn't this a formatting concern? This is an important question. Here is how I draw the line: formatting concerns should be constrained to locale-specific rendering options that do not affect the meaning of the value in context.

Let's look at the other knobs we currently have:

Symbols: This is clearly a formatting concern.
Grouping strategy: This is still a formatting concern, because the eventual grouping separator positions are locale-dependent.
Sign display: Perhaps this should be a data model concern instead of a formatting concern, because the decision on whether to display the sign is a developer decision that affects the rendering equally in all locales, and a number with a forced sign has different meaning than one without ("100" and "+100" mean different things).

Why do we need more info in the FixedDecimal data model? The primary reason is that this information can affect the plural rules and inflections of surrounding words in a sentence. For example, we retain trailing zeros because the plural forms of "1" and "1.0" are different. Likewise, the plural forms of "1200" and "1.2E3" could also be different. For sign display, it's probably not the case that the plural forms of "100" and "+100" are different, but they might result in different vowel sounds / inflections in a sentence, for example.

How about compact notation? We need to be able to express compact notation in the data model as well. Compact notation is orthogonal to scientific notation, so we may be able to store this information in the same field. However, I would like to figure out how to extend to compact notation in a separate issue. We'll also need to think about the impact on currencies, units, etc., which I hope to do when we tackle kitchen sink number format (#275).

Will this make FixedDecimal or FixedDecimalFormat heavier? It's crucial to keep FixedDecimal and FixedDecimalFormat as lightweight as possible. What I've described here is consistent with that goal. We are adding very little business logic, and perhaps a few more symbols to the data file.

Concretely, I would like to do the following:

Add a new field to FixedDecimal called visible_exponent, along with lots of documentation, APIs, etc.
Support the formatting of this field in FixedDecimalFormat, which may involve new locale data
Change FixedDecimal::from_str to retain the visible exponent
Move sign display to be a field in FixedDecimal instead of an option on FixedDecimalFormat. This requires a bit of extra design which I will post in a reply to this issue.
Open a follow-up issue to deal with compact notation and other knobs

Needs feedback from:

[x] @zbraniecki
[x] @Manishearth
[ ] @younies
[ ] @robertbastian
[x] @echeran

Nov 06 '21 00:11 sffc

How to represent sign display in FixedDecimal?

Currently, FixedDecimal has a boolean field "is_negative". This should change to an enum Sign with three possible values: Positive, Negative, and None. Note that this is different than Signum.

Examples:

FixedDecimal	Sign (stateful)	Signum (computed)
-1	Negative	BelowZero
-0	Negative	NegativeZero
0	None	PositiveZero
+0	Positive	PositiveZero
1	None	AboveZero
+1	Positive	AboveZero

Meanwhile, the SignDisplay enum can still be present, but only as a setter on FixedDecimal. It is not possible to persist the SignDisplay setting, and doing so is not important.

Nov 06 '21 04:11 sffc

I'm in favor of moving visible_exponent and signdisplay into FD itself; the above plan looks good to me.

A thing that is important to me is that we have a clear distinction between which properties should belong on FD and which should be a part of FDF's options bag: this does move us closer to that world, which is really nice.

Nov 08 '21 21:11 Manishearth

I'm not clear on the solution to the problem space. I can imagine the future where all four statements are true:

The developer wants to display given Decimal using scientific notatation
The locale used by the user prefers displaying given Decimal in the given context using scienfic notation (while another locale has different preference)
The user wants to be shown the Decimal using scientific notation

Given that, I draw the line in a different place - the notation has no impact on the objective value of the Decimal. It's a strong position weakly held, because I recognize that a similar argument can be made about "1.50" vs "1.5" having the same value.

My argument is that FixedDecimal differs from Number in exactly this one regard - how many trailing zeros it contains. We can of course have DisplayDecimal that differs in what notation is used, and the problem space can escalate quickly, or we can have mixins, but the crux in my thinking is this:

There is a number. Let's say 1.5. This number is an objective value.
There is a set of information about how to format it. Should it be displayed as 1.5 or 1.500 or 1.5e0, should it be displayed as +1.5 etc. All those are formatting toggles on a static, unchangable value.
There is a set of information per-locale that may provide defaults for a given context for how to display such value, or that information may be derived from parsing of the string with that number.
There is a set of user preferences that may impact how the value is to be formatted.

From that thinking come two components - Value and FormattingOptions - required to format that value.

The FormattingOptions can be derived from an input string (parse "1.500" as Value 1.5 and formatting option precision: 3), can come from locale+(context), or can come from user+(context).

That's not unlike... DateTimeFormat! Where the date is absolute, and then there are formatting options. Let's take an example of - display of Month - should Month be displayed as 8, 08, August, Aug, A? We could retrieve that by parsing an input month string. Or we can look at what is the default for a given locale and context (by context we mean here pattern), or we can check what the user prefers.

Encoding MonthWithStyle akin FixedDecimal that stores the value and display is an option, but we can just pass the formatting options to DateTimeFormat or FixedDecimalFormat.

The counter argument brought up by Shane is that scientific notation, or compat notation, is not locale specific - neither must be spelled month name vs moth number. The user may prefer that irrelevant of the locale, just like they may prefer scientific or non-scientific.

But because locale may be involved in deciding, we need that formatting options to be available in the Intl context and Intl context may provide some defaults.

Another argument is that we need precision to select PluralRule. That problem seems similar to some hypothetical date formatter that needs to know the gender of the month to format the date properly, and the gender of the month depends on whether it is textual or numerical (hypothetical).

Will we then have MonthFormatter that takes MonthWithStyle(Month, Style)? Will it scale?

I'm torn, but I think based on this I'm shying toward recognizing that Shane is right that precision is not unique and we need more formatting, maybe even locale-independent, toggles in the Decimal formatting. And maybe Decimal is a snowflake that is just more dominant as a value to justify DecimalWithFormattingOptions struct, that we likely will not want to replicate with other new types.

If that's the case then I'd approve that model, but not without some hesitation about consistency of the architecture.

Nov 08 '21 22:11 zbraniecki

I think what I'm trying to achieve with FixedDecimal and FixedDecimalFormat is to make FixedDecimalFormat as thin as possible, and focused on exclusively display concerns. The definition of "display concerns" is not clear-cut, but I attempted to draw some lines in the sand in the OP.

The locale used by the user prefers displaying given Decimal in the given context using scienfic notation (while another locale has different preference)

It could, perhaps. In much the same way that the locale+currency combination can affect the number of trailing zeros.

I anticipate that when we add Full Number Format, there will be mutations applied to the FixedDecimal. So, in some sense, Full Number Format is really two steps:

Mutate the FixedDecimal with changes that concretely affect the meaning of the value
Format the FixedDecimal, applying locale-specific symbols

One of the issue I've been grappling with is that we should really consider splitting this into two or even three different types of FixedDecimal: a "raw" number input, an "intermediate" that has been processed but not formatted, and an "output" that has been fully formatted. However, I've struggled to express that cleanly in either an API or a mental model. So my current proposed approach takes the position that FixedDecimal ought to do its best to serve all three of those use cases.

To be clear, the complete, comprehensive list of what I currently foresee FixedDecimal doing is:

Leading and trailing zeros (it already does this)
Visible plus or minus sign (proposed above)
Scientific notation
Compact notation*

Note that all of these except perhaps (4) are generally universally accepted in a decimal number string, like "+1.23E4".

I see the following things as out-of-scope, to be implemented perhaps as a wrapper over FixedDecimal:

Currencies
Measurement units
Percentages

* For compact notation, I would like to do what ECMA-402 and ICU 60+ do here, which is to consider compact notation as a "human readable scientific notation". It could be the case that compact notation is not directly expressed on the FixedDecimal, but is instead a display option for scientific notation, such that formatting "1.2E3" in compact notation produces "1.2 thousand", for example.

Nov 08 '21 23:11 sffc

CC @echeran. I added you to the approvers list.

Feb 26 '22 00:02 sffc

We need to define exactly what is a FixedDecimal and what is its lifecycle.

My mental model has been that a FixedDecimal is a locale-agnostic, structured representation of the human-readable form of a decimal number.

Locale-agnostic: Locale-specific symbols, including grouping separator positions, are not represented in FixedDecimal.
Structured: A FixedDecimal is more than a string; it supports programmatic operations. Plural selection is one of the operations that a FixedDecimal is designed to support.
Human-readable form: In-scope for FixedDecimal are leading/trailing zeros, scientific notation, and visible sign.
Decimal number: FixedDecimal does not support things like hexadecimal or spellout.

With that in mind, I have long seen it as a goal of FixedDecimal to guarantee that it is well-specified as soon as it is constructed, as discussed in #166. Put another way, we should avoid having a "partially-constructed" FixedDecimal.

FromStr for FixedDecimal is well-defined because our string syntax is capable of representing both leading and trailing zeros.

From<u32> for FixedDecimal is well-defined because integer-valued numbers can never have trailing zeros after the decimal separator. A small hiccup, which we have been ignoring, is that integer types cannot represent leading zeros (05u8 == 5u8).

From<f64> for FixedDecimal is not well-defined because:

f64 has values that FixedDecimal cannot represent right now: NaN, Inf, and -Inf. Discussion: #862
f64 cannot represent trailing zeros.

If we further add sign display and scientific notation to FixedDecimal, we make FixedDecimal diverge further from what the core numeric types are able to represent. We need to consider what this means for the lifecycle of a FixedDecimal.

Mar 23 '22 00:03 sffc

I agree with @sffc 's comment above, but I wanted to elaborate on the point about Human-readable form above. The question from @zbraniecki is a good one -- what data is essential vs. what is derivative?

When it comes to question and date time, I think about how Joda Time compares to ICU. Joda Time has been the go-to library in Java for making Dates and Times immutable and supporting basic DateTime arithmetic with time zones, etc., but in an ISO/Gregorian calendar (no calendars or other i18n things). It only stored 2 things: 1) number of milliseconds from the Unix epoch, and the time zone. Everything else can be derived. However, the ICU notion of a DateTime is more inclusive, so it needs more than those 2 fields in order to hold onto all of the essential data needed to cover all of the supported functionality.

So similarly, for some of the existing use cases in which FixedDecimal gets used, the meaning of a number is more expansive than just the mathematical value. The UTS 35 spec for plural rule operands shows that there are different values for the plural operands for different expressions, like 1200000 and 1.2c6, or 1 and 1.0. So we know 1.5 or 1.500 or 1.5e0 are all not the same in the eyes of PluralRules, for example.

On the possible question of whether we want to store that extra information (leading/trailing zeroes, exponent) in FixedDecimal, or else separately in a formatting options class, the latter seems like an artificial division that serves no benefit to the i18n algorithms which need them together anyways. For example, when parsing plural rule samples, we need to have a locale-agnostic way to represent such numbers after we parse them, and make sure that the result is still capable of conveying all of the distinctions in plural operand values that the input represented. If we instead had to parse a plural rule sample into FixedDecimal + formatting options, only to have to put them together in order to pass to PluralRules.select() to test the sample number against the rule, it would be extra effort here, but no discernible benefit elsewhere. I think info like leading/trailing zeroes and exponent go together for our use cases enough that it makes sense to keep it all together in FixedDecimal.

Mar 23 '22 21:03 echeran

Discussion with @eggrobin and others:

We should have a method to support a strict SampleValue syntax parsing for FixedDecimal
The other syntax, #.#E#, is common in the wild but easy to misuse
Each of these two syntaxes should have its own function for parsing, but neither syntax covers the full set of FixedDecimal functionality
We likely want to have a string representation that is a full superset, but it's unclear what that should be. We could modify UTS 35 to add anything else we need for it to be a superset (sign handling and maybe NaN/Inf), or we could invent our own syntax that uses E for hidden exponent and e or c for visible exponent.

Apr 26 '22 11:04 sffc

We could modify UTS 35 to add anything else we need for it to be a superset (sign handling and maybe NaN/Inf)

Assuming we punt on FixedDecimal NaN and infinity, on which see https://github.com/unicode-org/icu4x/issues/862#issuecomment-1109797198, https://unicode-org.atlassian.net/browse/CLDR-15609 would align the UTS #‌35 source number with FixedDecimal (as proposed here), and thus make a FixedDecimal uniquely representable by a sampleValue string.

May 03 '22 00:05 eggrobin

Having discussed this with @sffc last week, we found a couple of issues with putting exponents (of both the scientific and the compact kind) in FixedDecimal:

There is no CLDR support for pluralization on scientific notation (150×10⁶), only for compact numbers (150M), so plural case selection would have to fail (or produce broken grammar) for a scientific FixedDecimal.
While formatting a number in scientific notation does not require very complex code, formatting a compact number involves internal pluralization (e.g., in French, 1 million, 2 millions), and we do not want to require that code for formatting standalone non-compact numbers.

Having distinct intermediate representations ScientificDecimal and CompactDecimal built on top of FixedDecimal for scientific and compact decimal formatting thus seems like a better approach for both usability and modularity.

See the proposal (and rationale) in https://docs.google.com/document/d/1yjLPwM08Y_gf6-3FhDI9uaB8t_OCjliRtUSoMJ9_r98.

I would like approval on that proposal from:

[x] @Manishearth
[x] @sffc
[x] @zbraniecki
[x] @echeran

May 10 '22 09:05 eggrobin

For the purposes of 1.0, I think we should focus on landing the updated FixedDecimal (with the change to sign-display). CompactDecimal and ScientificDecimal can come soon after, in 1.1.

May 16 '22 15:05 sffc

I added a question in the doc about how to model the relationship between Fixed and the proposed Compact and Scientific. I'm thinking about traits, and thinking about how they might apply to these number types. Of course, we would also want to think about how the formatter in icu_decimal and the APIs in icu_plural would in turn accept inputs that implement the appropriate trait(s). It's worth thinking about, but not something to block getting started. Regardless, I'm on board with this, looks good.

May 17 '22 01:05 echeran

@eggrobin Is there anything left on this ticket?

Jul 28 '22 17:07 sffc

@sffc

Is there anything left on this ticket? Adding CompactDecimal and ScientificDecimal probably fits into the original scope of this issue (which saw that as an extension to FixedDecimal instead).

It no longer blocks 1.0 though, as discussed.

Aug 19 '22 11:08 eggrobin

I think this is done, because CompactDecimal and ScientificDecimal are landed.

Dec 22 '22 18:12 sffc

icu4x icu4x copied to clipboard

Division of concerns and scientific notation in FixedDecimal

How to represent sign display in FixedDecimal?

icu4x
icu4x copied to clipboard