icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Complete the set of DateTimeFormat options

Open zbraniecki opened this issue 5 years ago • 9 comments

  • [x] length::Bag
  • [ ] components::Bag #645
  • [x] #1317

zbraniecki avatar Sep 26 '20 04:09 zbraniecki

I'm interested in working on this after #107, although I see that @sffc is currently assigned to it. I would like to start with filling out this issue with links to prior art and discussions on the API design.

gregtatum avatar Jan 06 '21 15:01 gregtatum

I changed you to assignee, @gregtatum :)

Here are some of my thoughts on skeletons:

Predefined Common Skeletons

I would like to see fastpaths for the most common skeletons. Here is a good starting point from the Google Closure i18n library:

https://github.com/google/closure-library/blob/master/closure/goog/i18n/datetimepatterns.js

  YEAR_FULL
  YEAR_FULL_WITH_ERA
  YEAR_MONTH_ABBR
  YEAR_MONTH_FULL
  YEAR_MONTH_SHORT
  MONTH_DAY_ABBR
  MONTH_DAY_FULL
  MONTH_DAY_SHORT
  MONTH_DAY_MEDIUM
  MONTH_DAY_YEAR_MEDIUM
  WEEKDAY_MONTH_DAY_MEDIUM
  WEEKDAY_MONTH_DAY_YEAR_MEDIUM
  DAY_ABBR
  MONTH_DAY_TIME_ZONE_SHORT

I would also add month-day (date with no year/era), as I've seen that as a common feature request.

What do I mean by fastpath? I think we can precompile these patterns as separate data provider keys. For example, "dates/fulldatepattern@1" or maybe "datepatterns/fulldate@1" will return the specified pattern directly, without having to resolve skeletons at runtime.

I think this is important because:

  1. We don't have to ship DTPG* code for these use cases
  2. Faster performance
  3. Easier to add new ICU4X clients

* DTPG refers to DateTimePatternGenerator, the monolith of code in ICU (and soon ICU4X) that resolves datetime skeletons to datetime patterns

ECMA-402 Style Skeletons

ICU4C/J use strings to represent skeletons. ECMA-402 uses option bags instead.

I think the ECMA-402 approach is superior, especially for Rust, in large part because we can do more logic to figure out what data we might need. We can't resolve skeletons to patterns at compile time since they are locale-dependent, but we can at least tell what symbols* the skeleton might require. If the skeleton only requests date fields, then we don't need to include time symbols, for example.

We might want to supplement the ECMA-402 bag-based skeleton with a proc-macro that compiles the string-based skeleton to a bag.

* See #257 for an explanation of symbols data.

Spec Compliance

The UTS 35 spec for skeleton resolution is currently in a little bit of flux. The biggest outstanding issue is CLDR-13627, but there are others you can find on Jira.

Do as much at compile time as possible

The ICU code for skeleton resolution (DTPG) is a bit of a mess, and it would be nice if we can clean it up. Explore ways to do as much logic at build time (in CldrJsonDataProvider) as possible. Make the data provider give you the most useful form of data possible, such that the code we ship in actual DateTimeFormat should be as simple and clean as possible.

sffc avatar Jan 07 '21 01:01 sffc

@sffc - can you help me understand what you mean by "fastpaths for common skeletons"?

Are you suggesting to have DataProvider fastpath for Option Bags that match certain skeletons?

Do you also suggest that we don't provide options::Skeleton side by side with options::Bag but instead provide a proc macro that takes skeleton!() and produces options::Bag?

zbraniecki avatar Jan 07 '21 01:01 zbraniecki

@sffc - can you help me understand what you mean by "fastpaths for common skeletons"?

Are you suggesting to have DataProvider fastpath for Option Bags that match certain skeletons?

That's one way of doing it. What I more had in mind though would be a third method of instantiating a DateTimeFormat: datetime style (current), arbitrary components (bag of fields), and predefined components (one selection out of an enum with 10-15 choices). However, we could still fastpath arbitrary skeletons into predefined skeletons if they match.

Do you also suggest that we don't provide options::Skeleton side by side with options::Bag but instead provide a proc macro that takes skeleton!() and produces options::Bag?

That is what I am putting on the table for further discussion.

sffc avatar Jan 07 '21 09:01 sffc

This comment is for my notes as I look into the prior art. I plan on editing it with my research.

DateTimePatternGenerator

Pattern vs Skeleton

Per DateTimePatternGenerator::staticGetSkeleton

  • "MMM-dd" and "dd/MMM" are both considered patterns.
  • "MMMdd" is considered a skeleton representation of both.

ECMA 402

MDN DateTimeFormat options

var date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0, 200));
var options = { weekday: 'long', year: 'numeric', month: 'long', day: 'numeric' };
console.log(new Intl.DateTimeFormat('de-DE', options).format(date));

Components table:Table 6: Components of date and time formats

Internal Slot Property Values
[[Weekday]] "weekday" "narrow", "short", "long"
[[Era]] "era" "narrow", "short", "long"
[[Year]] "year" "2-digit", "numeric"
[[Month]] "month" "2-digit", "numeric", "narrow", "short", "long"
[[Day]] "day" "2-digit", "numeric"
[[Hour]] "hour" "2-digit", "numeric"
[[Minute]] "minute" "2-digit", "numeric"
[[Second]] "second" "2-digit", "numeric"
[[TimeZoneName]] "timeZoneName" "short", "long"

Gecko implementation

CLDR

Here is the dateTimeFormat information in the CLDR. It is broken into multiple sections. Already implemented in ICU4X is the "Style" format.

e.g. for Date:

            "dateFormats": {
              "full": "EEEE, MMMM d, y",
              "long": "MMMM d, y",
              "medium": "MMM d, y",
              "short": "M/d/yy"
            },

Then for DateTime, it references the Time and Date formats, where they are {0} and {1} respectively in the pattern.

            "dateTimeFormats": {
              "full": "{1} 'at' {0}",
              "long": "{1} 'at' {0}",
              "medium": "{1}, {0}",
              "short": "{1}, {0}",
              "availableFormats": { ... }
              }

The "availableFormats" key then matches skeletons to patterns.

              {
                "Bh": "h B",
                "Bhm": "h:mm B",
                "Bhms": "h:mm:ss B",
                "d": "d",
                "E": "ccc",
                "EBhm": "E h:mm B",
                "EBhms": "E h:mm:ss B",
                ...
              }

The work here, as I'm understanding it, is to generate a Rust representation of the skeleton, e.g. "Bh" or "Bhm", and then find the best skeleton, and return the pattern, e.g. ""h B" or "h:mm B".

TODO - Figure out how this differs from the "Component" model.

Field Symbol Table

  • http://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

UTS 35 - Matching skeletons

https://unicode.org/reports/tr35/tr35-dates.html#Matching_Skeletons

gregtatum avatar Jan 13 '21 17:01 gregtatum

So here is an update on where I am with this:

I've done a lot of the research on the prior art, and understanding the terminology. I'm currently working on some local prototypes of different pieces of the architecture.

I've done a bit of prototyping around the serialization of the skeletons for use with the data providers. I'm not quite happy with what I have locally, so I'm going to explore the relationship between the components::Bag fields, skeleton representations, and pattern representations.

Today I'm going to start prototyping some of the skeleton matching algorithm, as I think the serialization could be driven by the needs of this algorithm. I don't really have concrete work that's worth showing yet, but I'm hopeful to comment back on some of the API design discussion.

I'd also like to make sure that #451 lands before writing any real code, but that's not blocking some early prototypes.

gregtatum avatar Jan 27 '21 15:01 gregtatum

One thing I am discovering is that the components::Bag does not allow for every configuration of available date field symbols.

I also wrote a script that collects every skeleton available in the CLDR, and how many patterns are available in the locale for it.

https://gist.github.com/gregtatum/1d76bbdb87132f71a969a10f0c1d2d9c#file-2-output-js

gregtatum avatar Jan 27 '21 15:01 gregtatum

I think this issue needs better scoping, and a break out of separate issues. I added the discuss to add it to the meeting agenda. If we don't have time to discuss this week, I'll add my thoughts here.

gregtatum avatar Jan 28 '21 18:01 gregtatum

The C-API for ICU4C provides a list of common skeletons. I think this was interesting enough to document for future work: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/udat_8h.html

e.g.

#define | UDAT_YEAR   "y"
  | Constant for date skeleton with year.  More...
 
#define | UDAT_QUARTER   "QQQQ"
  | Constant for date skeleton with quarter.  More...
 
#define | UDAT_ABBR_QUARTER   "QQQ"
  | Constant for date skeleton with abbreviated quarter.  More...
 
#define | UDAT_YEAR_QUARTER   "yQQQQ"
  | Constant for date skeleton with year and quarter.  More...
 
#define | UDAT_YEAR_ABBR_QUARTER   "yQQQ"
  | Constant for date skeleton with year and abbreviated quarter.  More...
 
#define | UDAT_MONTH   "MMMM"
  | Constant for date skeleton with month.  More...

etc.

gregtatum avatar Apr 16 '21 14:04 gregtatum