messageformat.net icon indicating copy to clipboard operation
messageformat.net copied to clipboard

Number, date, time, duration formatters

Open rsheptolut opened this issue 5 years ago • 6 comments

Hi!

How about other formatters supported by messageformat.net and defined by the spec? Especially number, because otherwise there's no way for plural to work as it is supposed to, since now it just spits out the number "as is" in the Invariant Culture, without respect to the regional decimal separator and without any thousand separators.

Although I've looked at https://github.com/andyearnshaw/Intl.js/blob/master/src/11.numberformat.js and it looks pretty hopeless to faithfully translate that into C#.

rsheptolut avatar Sep 24 '18 16:09 rsheptolut

Would it help to be able to pass in a culture?

jeffijoe avatar Sep 24 '18 17:09 jeffijoe

@jeffijoe we're already passing the culture when constructing MessageFormatter, this should be enough, shouldn't it? I'm thinking more like, implement the number formatter as a simple proxy to C# facilities to format numbers. This thing would just use the culture already in MessageFormatter instance to create a CultureInfo and format a number according to the rules, whatever they are. Wouldn't be very up to spec of ICU MessageFormat, probably, but the easiest way to get the job done.

rsheptolut avatar Sep 24 '18 17:09 rsheptolut

And same thing with date, time and duration formatters, that are defined in messageformat.js but not implemented here. Just do a simple translator to call .NET Framework methods, basically what messageformat.js does, calling Intl for the most part.

rsheptolut avatar Sep 24 '18 17:09 rsheptolut

Oh, I forgot about that! That was added in #14.

Would you be willing to submit a PR?

jeffijoe avatar Sep 24 '18 17:09 jeffijoe

@jeffijoe I will if I decide to commit to messageformat.net for my project. The other thing that scares me is that I'll also have to port PluralRules from messageformat.js for other locales. That's a lot of locales!

rsheptolut avatar Sep 24 '18 17:09 rsheptolut

Locales are busywork so that's why I didn't bother. The most important part for me was the pluralisation constructs

jeffijoe avatar Sep 24 '18 17:09 jeffijoe

Are there really no formatters for numbers, dates, etc.? 😞

Basically, it should be able to do anything that MessageFormat.js can do.

Or not? 🙃

Argument formatting is a core part of the standard, and it's supported by JavaScript libraries, PHP, Java, etc.

glen-84 avatar Dec 20 '22 13:12 glen-84

Are there really no formatters for numbers, dates, etc.? 😞

Basically, it should be able to do anything that MessageFormat.js can do.

That was written at a time where plural and select were everything the JS version supported. But most of this is already something .NET supports natively.

No promises, but I can look into it if I get some time.

jeffijoe avatar Dec 20 '22 14:12 jeffijoe

@glen-84 from the page you linked, they even recommend pre-formatting arguments.

jeffijoe avatar Dec 20 '22 14:12 jeffijoe

@jeffijoe

It's the last of three recommended methods for the argument style, not the type.

{0, number, integer}
    ^ type  ^ style

The predefined styles are the most important, IMO.

glen-84 avatar Dec 20 '22 15:12 glen-84

@glen-84 looking at the skeletons, too. It looks like ICU4C and ICU4J use different formatting codes, right? Like j in C for hour vs h in Java? If we support skeletons, we would just forward it to the Dotnet datetime formatter.

What would short, medium, long, full be equivalent to for numbers and date/time? Are date and time different formatters? Is percent just adding a % at the end of the number, or is there more to it?

jeffijoe avatar Dec 23 '22 15:12 jeffijoe

@glen-84 looking at the skeletons, too. It looks like ICU4C and ICU4J use different formatting codes, right? Like j in C for hour vs h in Java? If we support skeletons, we would just forward it to the Dotnet datetime formatter.

I think the j is part of ICU: http://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

From that page:

Input skeleton symbol It must not occur in pattern or skeleton data. Instead, it is reserved for use in skeletons passed to APIs doing flexible date pattern generation. In such a context, it requests the preferred hour format for the locale (h, H, K, or k), as determined by the preferred attribute of the hours element in supplemental data. In the implementation of such an API, 'j' must be replaced by h, H, K, or k before beginning a match against availableFormats data. Note that use of 'j' in a skeleton passed to an API is the only way to have a skeleton request a locale's preferred time cycle type (12-hour or 24-hour).

I would assume that the symbols should be consistent across languages, for portability of the messages.

If that's true, there may need to be some form of translation between symbols, unless it's somehow possible to make use of the ICU data built into .NET.

It looks like skeletons can get quite complicated, so it may make sense to support predefined styles first, and then look at how skeletons could be supported.

It would likely also be okay to only support a subset of symbols, at least initially. FormatJS does this, due to limitations in ECMA402's Intl API.

What would short, medium, long, full be equivalent to for numbers and date/time?

Examples when using FormatJS, and the en locale.

Dates: (Date.Now)

{value, date} -> 12/23/2022 {value, date, short} -> 12/23/22 {value, date, medium} -> Dec 23, 2022 {value, date, long} -> December 23, 2022 {value, date, full} -> Friday, December 23, 2022

Numbers: (1234567)

{value, number} -> 1,234,567 {value, number, integer} -> 1,234,567 {value, number, ::currency/USD} -> $1,234,567.00 {value, number, percent} -> 123,456,700%

For numbers, string formatting may get close (N0, C, P0), but for dates it more likely requires ICU data.

Are date and time different formatters?

They are different types, yes.

Is percent just adding a % at the end of the number, or is there more to it?

That depends on the locale.


I think the next step would be to see what ICU-related APIs are available in .NET.

glen-84 avatar Dec 23 '22 20:12 glen-84

Just FYI, I am currently in the process of porting RuleBasedNumberFormat line-by-line from Java in ICU4N. It is still very much a work in progress and there are no plans at present to build it up to the point where MessageFormat is fully supported, but it is being factored in to how the pieces fit together. Given the complexity of requirements for this, rather than taking it all back to requirements I am hoping to port it line-by-line and gain enough understanding of the technical details and how much of the spec .NET actually supports in order to refactor the DecimalFormat and RuleBasedNumberFormat into something more .NET-like.

Last year, I also worked on building Java parsing/formatting functionality into .NET in J2N.

This goes beyond business requirements. In .NET, the parsers and formatters are done in a way where the state is marshalled over to the current thread/async task so there doesn't need to be any thread synchronization. They also use a lot of low-level optimizations like pointers, Span<char>, and a dedicated buffer on the current thread to make a subset of the spec really fast. In Java, there are tons of allocations and there is some thread synchronization added because each object manages its own data rather than keeping the data aligned with the current thread. I haven't run any benchmarks yet, but I suspect the .NET approach is at least 3x faster and will scale far better than how it was done in ICU4J due to the extra overhead of each object managing its own data and settings instead of sharing these across threads.

.NET only supports ASCII digits, where ICU4J supports digits for all cultures on certain parts of the string, such as exponent.

.NET only supports a subset of the features in the spec, but it is generally enough for most applications and it is done in a way that will work at scale.

I think there is plenty of room in the .NET ecosystem for both messageformat.net and ICU4N to exist. IMO, it would be best if ICU4N takes care of the more advanced features and messageformat.net is kept as a lightweight alternative for those who only want to extend formatting in .NET. Although I don't object if you want to pull data out of the CLDR to provide additional features, to me it would make more sense from both a performance and maintainability perspective if you simply stick with the features that .NET already supports, where possible. That being said, it does make sense to pool our knowledge in order to see how much of the spec we have and how much we are missing as well as how to map feature per feature from ICU to .NET.

I have just completed porting the Currency class from ICU4J, which is a dependency of ICU4J's DecimalFormat class. This is what supplies the JPY, USD, etc. currency codes to the formatter. .NET doesn't expose this data, but there is a wrinkle with it that I didn't expect - the currency codes are dependent on date and furthermore there can also be more than one currency code in use in a different culture. They are supplied in order of precedence. And this is just one small bit of the equation of currency parsing/formatting that is just a small bit of formatting numbers into strings.

The settings are also complicated by the fact that they are non-orthogonal so it is sometimes difficult to understand which settings apply in which circumstances. For example, specifying the NumberFormatInfo.NumberDecimalDigits option only works in combination with the F or N formats. ICU's solution was to build a fluent API to channel users through the settings, taking away settings that are no longer valid. While I agree with this in principle, there really ought to be an API to use like the one in .NET for those who don't want to deal with extra allocations associated with a fluent API or at least make it so the settings derived from the fluent API can be cached and passed onto the formatter at runtime.

Duration format - do note this exists in .NET on the TimeSpan class.

  1. https://learn.microsoft.com/en-us/dotnet/standard/base-types/standard-timespan-format-strings
  2. https://learn.microsoft.com/en-us/dotnet/standard/base-types/custom-date-and-time-format-strings

I haven't yet looked at it to see whether there are gaps that .NET doesn't support that ICU does, as this isn't the primary focus at present.

Date format mapping - we have done a simple mapping in NumberDateFormat. However, it is probably wrong to assume that all dates should go before times separated by a space.

You may also find this document helpful: https://unicode.org/reports/tr35/tr35-numbers.html.

NightOwl888 avatar Dec 24 '22 01:12 NightOwl888

Thanks for your input @NightOwl888.

It's really unfortunate that ICU APIs are not part of the framework.

It doesn't seem practical to implement this "from scratch", so maybe this library should just add basic support using .NET format strings, and perhaps clarify in the README that the behaviour is not 1-to-1 with other ICU libraries.

If ICU4N ever reaches a point where number, date, and other formatting is fully supported, messageformat.net could consider making use of that functionality to better align with ICU standards.

glen-84 avatar Dec 28 '22 13:12 glen-84

It's really unfortunate that ICU APIs are not part of the framework.

Or fortunate, depending on how you look at it. The ICU code isn't always the most efficient and the fact that Microsoft made a high-performance formatter instead of trying to pipe stuff back to the C++ version (and deal with the threading issues) is something we all benefit from.

It doesn't seem practical to implement this "from scratch", so maybe this library should just add basic support using .NET format strings, and perhaps clarify in the README that the behaviour is not 1-to-1 with other ICU libraries.

Well, messageformat.net is a bit more "from scratch" than the direction that ICU4N is heading, being that messageformat.net uses the pluralization data from the CLDR. I wasn't trying to discourage anyone from adding the date, time, and duration features, but was just trying to be helpful on how much work is involved in doing such a thing.

If ICU4N ever reaches a point where number, date, and other formatting is fully supported, messageformat.net could consider making use of that functionality to better align with ICU standards.

Seems a bit strange to do it that way, since messageformat.net is tiny and ICU4N has ~20MB of resource files (the next release will put them into satellite assemblies). IMO, adding the extra features to messageformat.net would be good since they can be combined more easily to build the formatted output.

As for ICU4N, I am still debating how to deal with formatting. .NET provides zero support for extending parsers, and custom formatters don't let you format any date or number types (they are hard coded to ask for a NumberFormatInfo or DateTimeFormatInfo, which are both sealed). And I have learned by porting the documentation comments that the design of MessageFormat is over 20 years old. It really could have been a lot better if it had been designed after generics were a thing.

The way forward in ICU is apparently using fluent APIs all the way (there is a preview of MessageFormatter in the current version which does just that).

Of course, this design still breaks a core design principle of .NET - never store CultureInfo in a field! Otherwise, if you specify CultureInfo.CurrentCulture you will be in for a surprise when your message is formatted in the current culture when the formatter was created, not the current culture now.

For .NET, I am envisioning a static API like the ones Microsoft made that can be used directly by advanced users (accepting a lump of settings and culture data as parameters) with an ICU-like fluent API for novice users or those who want the formatting settings to be "in plain English" in the code.

The ideal solution would extend .NET so message format works with string interpolation and other parts of the framework, but I think there needs to be a discussion with Microsoft to be able to pull that off. It wouldn't be very practical for Microsoft to marry the ICU functionality with the .NET formatters using the underlying ICU library, especially being that there is still an option to "opt out" of ICU in .NET Core.

NightOwl888 avatar Dec 28 '22 14:12 NightOwl888

Or fortunate, depending on how you look at it.

Well, I didn't necessarily suggest that they'd just call into the C version. They have more resources to write a custom implementation if they really wanted to.

I wasn't trying to discourage anyone from adding the date, time, and duration features, but was just trying to be helpful on how much work is involved in doing such a thing.

I appreciate that. It's very clear that nothing localization-related is trivial.

Seems a bit strange to do it that way, since messageformat.net is tiny and ICU4N has ~20MB of resource files (the next release will put them into satellite assemblies). IMO, adding the extra features to messageformat.net would be good since they can be combined more easily to build the formatted output.

I'm not sure what you mean – are you suggesting just using .NET format strings? If so, my point is that this will no longer match ICU standards (there will likely be a lot of little differences, plus differences in formatting symbols, etc.), and implementing all the ICU stuff in this library would be a large undertaking.

I think there needs to be a discussion with Microsoft to be able to pull that off

Let me know if you open any issues in this regard, I'd be happy to:+1:and follow along.

glen-84 avatar Dec 28 '22 18:12 glen-84

I can implement very basic support as mentioned by @glen-84 to at least get the ball rolling. I would just like to know which format codes to use for the various styles.

For example, if I use N for decimal, formatting a decimal input = 69, I get 69.000 for the en culture, is that correct?

jeffijoe avatar Jan 01 '23 14:01 jeffijoe

I don't think there is a decimal style?

I've put some data together (count = 1234567.1234567):

en-US de-DE ar
{count, number}
- FormatJS 1,234,567.123 1.234.567,123 ١٬٢٣٤٬٥٦٧٫١٢٣
- PHP 1,234,567.123 1.234.567,123 ١٬٢٣٤٬٥٦٧٫١٢٣
- C# (Format specifier: "N") 1,234,567.123 1.234.567,123 1٬234٬567٫123
{count, number, currency}
- FormatJS (::currency/?) $1,234,567.12 1.234.567,12 € ١٬٢٣٤٬٥٦٧٫١٢ ر.س.
- PHP (::currency/?) $1,234,567.12 1.234.567,12 € ١٬٢٣٤٬٥٦٧٫١٢ ر.س.
- C# (Format specifier: "C") $1,234,567.12 1.234.567,12 € 1٬234٬567٫12 ر.س.
{count, number, integer}
- FormatJS 1,234,567 1.234.567 ١٬٢٣٤٬٥٦٧
- PHP 1,234,567 1.234.567 ١٬٢٣٤٬٥٦٧
- C# (Format specifier: "N0") 1,234,567 1.234.567 1٬234٬567
{count, number, percent}
- FormatJS 123,456,712% 123.456.712 % ١٢٣٬٤٥٦٬٧١٢٪
- PHP 123,456,712% 123.456.712 % ١٢٣٬٤٥٦٬٧١٢٪
- C# (Format specifier: "P0") 123,456,712% 123.456.712 % 123٬456٬712٪

Notes:

  • The .NET localization doesn't seem to localize Arabic numbers.

glen-84 avatar Jan 01 '23 18:01 glen-84

The .NET localization doesn't seem to localize Arabic numbers.

As I had previously mentioned, .NET parsers and formatters only support ASCII digits (that is 0-9).

However, .NET does provide the digits for each culture in the NumberFormatInfo.NativeDigits property.

The .NET formatter doesn't have many moving parts and is open source. I have copied it into J2N in order to add additional functionality to it (although, in our case I just wanted to add a "J" (Java) format to it). Simply copying and pasting and then modifying it to display the native digits isn't very complicated except for the fact that the .NET code has optimizations that may not be supported on older versions of .NET that you may need to conditionally compile for. We opted not to add a dependency on System.Memory, but in hindsight that was a mistake.

Note also that round trip formatting is completely broken before .NET Core 3. Copying the code from a recent release of .NET Core is also a way to make the formatting (rounding) consistent between different .NET flavors.

Here are some of the differences I noticed between the .NET formatter and ICU4J.

  1. ICU4J supports a minimum and maximum number of decimal places, but .NET only supports an exact number of decimal places (and only in the "N" or "F" formats). There is no way to make if simply float the way Java does unless using a custom number pattern.
  2. ICU4J has a way to add the currency code in addition to the currency symbol which .NET is lacking. .NET doesn't even have an API where you can get the currency codes. As previously mentioned, there can be more than one currency code per culture, also.
  3. ICU4J has format strings to specify where to put the currency symbol, how to display the negative format, etc. .NET uses an integer for each of these, so they cannot be customized by the end user.

Do note that in .NET we have the decimal format which works for most use cases for currency. However, in Java the formatter uses a BigDecimal type (arbitrarily large number). This implementation seems to be pretty accurate. Its parser seems to accept non ASCII digits, but it looks like they primarily used .NET's built in formatter for displaying the numbers, which are ASCII digits only.

NightOwl888 avatar Jan 01 '23 20:01 NightOwl888

As I had previously mentioned, .NET parsers and formatters only support ASCII digits (that is 0-9).

Apologies, I missed that.

I see that someone wanted to work on implementing this in the runtime (https://github.com/dotnet/runtime/issues/47749), but it was declined for questionable reasons. I guess we can just do the substitution ourselves.

@jeffijoe Let me know if you have any other questions.

glen-84 avatar Jan 14 '23 14:01 glen-84

Thanks for the link. It makes sense given the huge effort it must have taken to optimize the parsers and formatters, although, being that they have a DigitSubstitution property for it that is "reserved for future use", that could be a switch to go down a slower path I don't really understand why they wouldn't add a slow path that could be enabled using that property.

Non-ASCII digits may include surrogate pairs so using them may require up to 2 chars per digit, which complicates the logic a bit and will definitely be slower than simply using ASCII digits. This is fine as long as the ASCII path is optimized so it doesn't have to deal with double-character substitutions.

FYI - The ICU way of doing substitutions is to allow a "numbers=" parameter on the culture string so the numbering system can be defined when the Locale object is created. ICU4N allows this syntax with the UCultureInfo class (although currently numbering system support is a work in progress).

In .NET, the same functionality is allowed only by subclassing CultureInfo and making the subclass set custom values (in this case, setting the NativeDigits and DigitSubstitution) so it can be re-used. You can also create a NumberFormatInfo object (or clone one) and set the properties manually before passing them to a formatter or parser as a one-off. However, as pointed out it is currently pointless because the formatters and parsers don't support these properties. But messageformat.net could.

NightOwl888 avatar Jan 14 '23 22:01 NightOwl888

@glen-84 if you can get me a map of the various pre-defined styles for dates, times and timestamps and what they map to for the dotnet formatting codes, that would help a lot.

jeffijoe avatar Jan 17 '23 00:01 jeffijoe

Opened a draft PR @glen-84 @NightOwl888 https://github.com/jeffijoe/messageformat.net/pull/33/files

jeffijoe avatar Jan 17 '23 00:01 jeffijoe

@jeffijoe Sorry about the delay, I'll try to get back to you within the next ~2 weeks.

glen-84 avatar Jan 22 '23 14:01 glen-84

@jeffijoe

I don't know if this is going to be feasible. 😢

There's no medium or full date format in .NET, only short and long.

en-US de-DE ar
{date, date}
- FormatJS 1/1/2000 1.1.2000 ١/١/٢٠٠٠
- PHP Jan 1, 2000 01.01.2000 ٠١/٠١/٢٠٠٠
- C# (Format specifier: "d") 1/1/2000 01.01.2000 1‏‏/1‏‏/2000
{date, date, full}
- FormatJS Saturday, January 1, 2000 Samstag, 1. Januar 2000 السبت، ١ يناير ٢٠٠٠
- PHP Saturday, January 1, 2000 Samstag, 1. Januar 2000 السبت، ١ يناير ٢٠٠٠
- C# (Format specifier: "D") Saturday, January 1, 2000 Samstag, 1. Januar 2000 السبت، 1 يناير 2000
{date, date, long}
- FormatJS January 1, 2000 1. Januar 2000 ١ يناير ٢٠٠٠
- PHP January 1, 2000 1. Januar 2000 ١ يناير ٢٠٠٠
- C# (Format specifier: "?")
{date, date, medium}
- FormatJS Jan 1, 2000 1. Jan. 2000 ١ يناير ٢٠٠٠
- PHP Jan 1, 2000 01.01.2000 ٠١/٠١/٢٠٠٠
- C# (Format specifier: "?")
{date, date, short}
- FormatJS 1/1/00 1.1.00 ١/١/٠٠
- PHP 1/1/00 01.01.00 ١/١/٢٠٠٠
- C# (Format specifier: "d") 1/1/2000 01.01.2000 1‏‏/1‏‏/2000

If you do decide to proceed with something, I can add another table for time formats.

glen-84 avatar Jan 28 '23 15:01 glen-84

@glen-84

So I see 2 viable options (short of doing a full-blown implementation which I won't have time for now):

  1. Don't support the style, just format with g by default and support the :: skeleton syntax
  2. Support just the styles that Dotnet supports (of which I need a mapping), and support the :: skeleton syntax

jeffijoe avatar Jan 30 '23 10:01 jeffijoe

I think a middle-ground would be to:

  • By default, map no style, medium, and short to "d", and full and long to "D".
  • Allow the user to set a custom .NET format string for each style and locale combination.
    • f.e. Set configuration or call a method like SetDateStylePattern(DateStyle.MEDIUM, "de-DE", "m.d.y").
    • When formatting a de-DE date with the medium style, it would then use the format m.d.y instead of the default d.
    • This would allow users to set custom formats per locale, and/or add appropriate patterns for medium and full.

Regarding skeletons, they're not the same as format patterns, so they would not be simple to implement without full locale data.

See an example here. A skeleton like MMMMdjmm actually expands to a pattern, depending on the locale. For en-US, it may expand to MMMM d 'at' h:mm a, for es_ES to d 'de' MMMM, H:mm, etc.

It's also important to note that the skeleton itself should use the ICU characters for skeletons, which may not match those used in .NET.

For this reason, it may be best not to support skeleton syntax at this time.

glen-84 avatar Jan 30 '23 18:01 glen-84

For time:

en-US de-DE ar
{time, time}
- FormatJS 1:01:01 AM 01:01:01 ١:٠١:٠١
- PHP 1:01:01 AM 01:01:01 ١:٠١:٠١
- C# (Format specifier: "T") 1:01:01 AM 01:01:01 1:01:01 ص
{time, time, full}
- FormatJS 1:01:01 AM UTC 1:01:01 UTC ١:٠١:٠١ ص UTC
- PHP 1:01:01 AM Coordinated Universal Time 01:01:01 Koordinierte Weltzeit ١:٠١:٠١ ص التوقيت العالمي المنس
- C# (Format specifier: "?")
{time, time, long}
- FormatJS 1:01:01 AM UTC 1:01:01 UTC ١:٠١:٠١ ص UTC
- PHP 1:01:01 AM UTC 01:01:01 UTC ١:٠١:٠١ ص UTC
- C# (Format specifier: "?")
{time, time, medium}
- FormatJS 1:01:01 AM 01:01:01 ١:٠١:٠١
- PHP 1:01:01 AM 01:01:01 ١:٠١:٠١
- C# (Format specifier: "T") 1:01:01 AM 01:01:01 1:01:01 ص
{time, time, short}
- FormatJS 1:01 AM 01:01 ١:٠١
- PHP 1:01 AM 01:01 ١:٠١
- C# (Format specifier: "t") 1:01 AM 01:01 1:01 ص

glen-84 avatar Jan 30 '23 18:01 glen-84

Set configuration or call a method like SetDateStylePattern(DateStyle.MEDIUM, "de-DE", "m.d.y")

I'd rather not do this as it would require someone to either replace or reach into the formatter library to be able to configure the relevant formatter, or we would need some sort of options bag to pass around. I think mapping as you mentioned would be sufficient. Seeing how inconsistent the behavior is across all these implementations/runtimes makes me feel less bad about it. 😅

jeffijoe avatar Jan 30 '23 18:01 jeffijoe

You already have Pluralizers, so the API could be similar:

var mf = new MessageFormatter();

mf.DateStylePatterns.Add(
    "de-DE",
    style => style switch
    {
        DateStyle.MEDIUM => "m.d.y"
        _ => null // use default format
    });

Just an idea. There may be better designs.

glen-84 avatar Feb 05 '23 13:02 glen-84