icu4x
icu4x copied to clipboard
Complete the set of DateTimeFormat options
- [x] length::Bag
- [ ] components::Bag #645
- [x] #1317
I'm interested in working on this after #107, although I see that @sffc is currently assigned to it. I would like to start with filling out this issue with links to prior art and discussions on the API design.
I changed you to assignee, @gregtatum :)
Here are some of my thoughts on skeletons:
Predefined Common Skeletons
I would like to see fastpaths for the most common skeletons. Here is a good starting point from the Google Closure i18n library:
https://github.com/google/closure-library/blob/master/closure/goog/i18n/datetimepatterns.js
YEAR_FULL
YEAR_FULL_WITH_ERA
YEAR_MONTH_ABBR
YEAR_MONTH_FULL
YEAR_MONTH_SHORT
MONTH_DAY_ABBR
MONTH_DAY_FULL
MONTH_DAY_SHORT
MONTH_DAY_MEDIUM
MONTH_DAY_YEAR_MEDIUM
WEEKDAY_MONTH_DAY_MEDIUM
WEEKDAY_MONTH_DAY_YEAR_MEDIUM
DAY_ABBR
MONTH_DAY_TIME_ZONE_SHORT
I would also add month-day (date with no year/era), as I've seen that as a common feature request.
What do I mean by fastpath? I think we can precompile these patterns as separate data provider keys. For example, "dates/fulldatepattern@1" or maybe "datepatterns/fulldate@1" will return the specified pattern directly, without having to resolve skeletons at runtime.
I think this is important because:
- We don't have to ship DTPG* code for these use cases
- Faster performance
- Easier to add new ICU4X clients
* DTPG refers to DateTimePatternGenerator, the monolith of code in ICU (and soon ICU4X) that resolves datetime skeletons to datetime patterns
ECMA-402 Style Skeletons
ICU4C/J use strings to represent skeletons. ECMA-402 uses option bags instead.
I think the ECMA-402 approach is superior, especially for Rust, in large part because we can do more logic to figure out what data we might need. We can't resolve skeletons to patterns at compile time since they are locale-dependent, but we can at least tell what symbols* the skeleton might require. If the skeleton only requests date fields, then we don't need to include time symbols, for example.
We might want to supplement the ECMA-402 bag-based skeleton with a proc-macro that compiles the string-based skeleton to a bag.
* See #257 for an explanation of symbols data.
Spec Compliance
The UTS 35 spec for skeleton resolution is currently in a little bit of flux. The biggest outstanding issue is CLDR-13627, but there are others you can find on Jira.
Do as much at compile time as possible
The ICU code for skeleton resolution (DTPG) is a bit of a mess, and it would be nice if we can clean it up. Explore ways to do as much logic at build time (in CldrJsonDataProvider) as possible. Make the data provider give you the most useful form of data possible, such that the code we ship in actual DateTimeFormat should be as simple and clean as possible.
@sffc - can you help me understand what you mean by "fastpaths for common skeletons"?
Are you suggesting to have DataProvider fastpath for Option Bags that match certain skeletons?
Do you also suggest that we don't provide options::Skeleton side by side with options::Bag but instead provide a proc macro that takes skeleton!() and produces options::Bag?
@sffc - can you help me understand what you mean by "fastpaths for common skeletons"?
Are you suggesting to have DataProvider fastpath for Option Bags that match certain skeletons?
That's one way of doing it. What I more had in mind though would be a third method of instantiating a DateTimeFormat: datetime style (current), arbitrary components (bag of fields), and predefined components (one selection out of an enum with 10-15 choices). However, we could still fastpath arbitrary skeletons into predefined skeletons if they match.
Do you also suggest that we don't provide
options::Skeletonside by side withoptions::Bagbut instead provide a proc macro that takesskeleton!()and producesoptions::Bag?
That is what I am putting on the table for further discussion.
This comment is for my notes as I look into the prior art. I plan on editing it with my research.
DateTimePatternGenerator
Pattern vs Skeleton
Per DateTimePatternGenerator::staticGetSkeleton
- "MMM-dd" and "dd/MMM" are both considered patterns.
- "MMMdd" is considered a skeleton representation of both.
ECMA 402
var date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0, 200));
var options = { weekday: 'long', year: 'numeric', month: 'long', day: 'numeric' };
console.log(new Intl.DateTimeFormat('de-DE', options).format(date));
Components table:Table 6: Components of date and time formats
| Internal Slot | Property | Values |
|---|---|---|
| [[Weekday]] | "weekday" | "narrow", "short", "long" |
| [[Era]] | "era" | "narrow", "short", "long" |
| [[Year]] | "year" | "2-digit", "numeric" |
| [[Month]] | "month" | "2-digit", "numeric", "narrow", "short", "long" |
| [[Day]] | "day" | "2-digit", "numeric" |
| [[Hour]] | "hour" | "2-digit", "numeric" |
| [[Minute]] | "minute" | "2-digit", "numeric" |
| [[Second]] | "second" | "2-digit", "numeric" |
| [[TimeZoneName]] | "timeZoneName" | "short", "long" |
CLDR
Here is the dateTimeFormat information in the CLDR. It is broken into multiple sections. Already implemented in ICU4X is the "Style" format.
e.g. for Date:
"dateFormats": {
"full": "EEEE, MMMM d, y",
"long": "MMMM d, y",
"medium": "MMM d, y",
"short": "M/d/yy"
},
Then for DateTime, it references the Time and Date formats, where they are {0} and {1} respectively in the pattern.
"dateTimeFormats": {
"full": "{1} 'at' {0}",
"long": "{1} 'at' {0}",
"medium": "{1}, {0}",
"short": "{1}, {0}",
"availableFormats": { ... }
}
The "availableFormats" key then matches skeletons to patterns.
{
"Bh": "h B",
"Bhm": "h:mm B",
"Bhms": "h:mm:ss B",
"d": "d",
"E": "ccc",
"EBhm": "E h:mm B",
"EBhms": "E h:mm:ss B",
...
}
The work here, as I'm understanding it, is to generate a Rust representation of the skeleton, e.g. "Bh" or "Bhm", and then find the best skeleton, and return the pattern, e.g. ""h B" or "h:mm B".
TODO - Figure out how this differs from the "Component" model.
Field Symbol Table
- http://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
UTS 35 - Matching skeletons
https://unicode.org/reports/tr35/tr35-dates.html#Matching_Skeletons
So here is an update on where I am with this:
I've done a lot of the research on the prior art, and understanding the terminology. I'm currently working on some local prototypes of different pieces of the architecture.
I've done a bit of prototyping around the serialization of the skeletons for use with the data providers. I'm not quite happy with what I have locally, so I'm going to explore the relationship between the components::Bag fields, skeleton representations, and pattern representations.
Today I'm going to start prototyping some of the skeleton matching algorithm, as I think the serialization could be driven by the needs of this algorithm. I don't really have concrete work that's worth showing yet, but I'm hopeful to comment back on some of the API design discussion.
I'd also like to make sure that #451 lands before writing any real code, but that's not blocking some early prototypes.
One thing I am discovering is that the components::Bag does not allow for every configuration of available date field symbols.
I also wrote a script that collects every skeleton available in the CLDR, and how many patterns are available in the locale for it.
https://gist.github.com/gregtatum/1d76bbdb87132f71a969a10f0c1d2d9c#file-2-output-js
I think this issue needs better scoping, and a break out of separate issues. I added the discuss to add it to the meeting agenda. If we don't have time to discuss this week, I'll add my thoughts here.
The C-API for ICU4C provides a list of common skeletons. I think this was interesting enough to document for future work: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/udat_8h.html
e.g.
#define | UDAT_YEAR "y"
| Constant for date skeleton with year. More...
#define | UDAT_QUARTER "QQQQ"
| Constant for date skeleton with quarter. More...
#define | UDAT_ABBR_QUARTER "QQQ"
| Constant for date skeleton with abbreviated quarter. More...
#define | UDAT_YEAR_QUARTER "yQQQQ"
| Constant for date skeleton with year and quarter. More...
#define | UDAT_YEAR_ABBR_QUARTER "yQQQ"
| Constant for date skeleton with year and abbreviated quarter. More...
#define | UDAT_MONTH "MMMM"
| Constant for date skeleton with month. More...
etc.