icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Split date time data into smaller data keys?

Open zbraniecki opened this issue 5 years ago • 23 comments

As I'm implementing Dates in DataProvider and testing them using DateTimeFormat, I have some questions about how should we structure that.

Generally, the data in question looks like this: https://github.com/unicode-cldr/cldr-dates-modern/tree/master/main/en

It has (per locale):

  • display names for Months, Weekdays, DayPeriods, Quarters, Eras
  • patterns for time, date, and date_time
  • list of best patterns for skeletons
  • interval patterns
  • time zone names
  • relative display names

For now, we need:

  • display names for months, weekdays, day periods
  • patterns for time, date and date_time
  • list of best patterns for skeletons

Display names come in different:

  • contexts ("format" and "stand_alone")
  • widths ("abbreviated", "narrow", "short", "wide")

but they also can be for different calendar systems (I see at least "generic" and "gregorian").

As far as I understand DataProvider we have some flexibility in what do we request and what we get in response.

We could, for example, put "months/format/narrow" as a variant in DataEntry and gregory in DataKey and get just a list of month names in "format" and "narrow" for "gregory" calendar.

Or, we can just ask for "gregory" and set no variant in DataEntry and get all display names for all contexts and all widths.

@sffc - what are your thoughts on that? How should a request/response look like?

zbraniecki avatar Sep 20 '20 07:09 zbraniecki

Based on @sffc comments in that PR, it also seems like he'd suggest we split the dates data into multiple keys:

  • datesym ( month display names, weekday display names)
  • timesym ( day period display names)
  • patterns (for style, and for skeleton)

I assume that for things like time zone names, interval patterns and relative display names he's suggestion we also create separate keys.

zbraniecki avatar Sep 20 '20 08:09 zbraniecki

First, please leave skeletons (including available_formats) aside. Those need their own discussion.

Responses to the several sub-questions:

DataProvider Flexibility

As far as I understand DataProvider we have some flexibility in what do we request and what we get in response.

Right. The intended goal of the data provider keys is not to mimic CLDR. It's to return data that is easy and efficient to consume at runtime. Mapping from CLDR format to ICU4X format happens in CldrJsonDataProvider. I want to make sure we are aligned on that goal.

Number of Data Keys

Regarding the breakdown of the data keys. I feel strongly that there should be a minimum of three distinct data keys for DateTimeFormat, which Zibi listed in https://github.com/unicode-org/icu4x/issues/257#issuecomment-695760621. Here are my reasons:

  1. A component can request the data it needs and nothing more.
  2. The same key can be shared by multiple code paths; for example, datesym can be used by both the datetimestyle path and the skeleton path.
  3. Smaller data hunks mean that more locales will be able to point to shared hunks and make data smaller on disk.

Format Widths for Display Names

I've been thinking about the format widths (long/short/narrow) for display names and whether they belong in the data requests (key/entry) or not. I'm thinking that no, they don't belong in the data request; long/short/narrow should be together in the same data key and entry, and they should be a leaf of the struct. My reasoning:

  1. The width trio (long/short/narrow) tends to be strongly correlated; if one changes, so do the others. Therefore, considering the trio as a single key does not decrease the ability to create shared data hunks.
  2. It is not uncommon to fall back between widths; for example, if narrow is not present, then we fall back to short.
  3. In order for a split data key to work as a method to reduce code size, we would need to track the requested width through much of the call stack in a way suitable for code slicing. I think this is too fine-grained to track.

Format Widths for Patterns

The situation is a bit different for pattern widths (dateStyle/timeStyle). None of the three aforementioned conditions apply here: the width patterns are not strongly correlated; they do not fall back; and we should be able to slice the data very early in the call stack.

Therefore, I think it makes sense to put format widths into either the data key or the data entry.

I would like to make this judgement after we have the code written and we can sit down and look at the concrete implications to the data bundles.

Calendar Systems

I'm arriving at the conclusion that calendar systems should be in the data key. Reasons:

  1. The data for each calendar system is unique. Not all systems are represented by 12 month names. When we need a new struct (data schema), we need a new data key.
  2. The code needed to run calendar systems needs to be written for each individual calendar, so the data keys will correspond to those modules of code.

We may at some point want to add an all-in-one calendar data key, but this is not relevant right now. We should cross that bridge later when we add calendar math to ICU4X.

sffc avatar Sep 20 '20 20:09 sffc

Thank you! This is so helpful!

zbraniecki avatar Sep 21 '20 01:09 zbraniecki

Because of reasons I discussed in the new doc datetime-input.md, I now believe that we should not have different data keys for different calendar systems. We should pool the essential symbols for calendar systems into the same data key. We could consider filtering the data in the data entry or the offline build tool.

However, I do still feel that we should have different data keys for patterns, date symbols, and time symbols. We should probably go further and split the date symbols down into eras, months, day periods, and time zone names.

sffc avatar Oct 15 '20 06:10 sffc

Concretely, I envision DateTimeFormat using the following separate, orthogonal keys, which covers all formatting except for time zone (which I want to leave to a separate discussion):

Display Names

  1. datetime/era@1 collects era display names covering all desired calendar systems
  2. datetime/cycyear@ covers cyclic year names
  3. datetime/quarter@1 covers quarter names
  4. datetime/month@1 covers month names
  5. datetime/weekday@1 covers weekday names
  6. datetime/dayperiod@1 covers the a and b day periods (am, noon, pm, and midnight)
  7. datetime/flexperiod@1 covers the B day periods (in the morning, in the afternoon, …)

Format Patterns

  1. datetime/patterns@1 covers long/medium/short date patterns, time patterns, and glue patterns
  2. datetime/skeletons@1 covers availableFormats: the mapping from skeletons to patterns

Why more keys instead of fewer keys? I listed reasons in https://github.com/unicode-org/icu4x/issues/257#issuecomment-695830848, but to reiterate:

  1. A component can request the data it needs and nothing more.
  2. The same key can be shared by multiple code paths; for example, month names can be used by both the datetimestyle path and the skeleton path.
  3. Smaller data hunks mean that more locales will be able to point to shared hunks and make data smaller on disk.

sffc avatar Oct 31 '20 07:10 sffc

I'm convinced. This looks like a great design. One additional benefit of it is that version changes will be less common and more isolated in a more chunked model.

zbraniecki avatar Oct 31 '20 14:10 zbraniecki

There are still some open questions in my mind about how exactly to provision data across calendar systems, but that question is being tracked in #355.

sffc avatar Nov 04 '20 13:11 sffc

Shane to implement this along with #355.

sffc avatar Nov 06 '20 19:11 sffc

Blocked on #409 like #355

sffc avatar Dec 09 '20 00:12 sffc

Also migrate the date data provider structs to have real lifetime parameters.

sffc avatar Feb 09 '21 17:02 sffc

I'm going to punt this to Q2 because I want to wait for the work on availableFormats to stabilize. I don't see a need to introduce merge conflicts.

sffc avatar Feb 26 '21 07:02 sffc

stealing from Shane with his blessing.

zbraniecki avatar Jun 07 '21 21:06 zbraniecki

Re-opening to address the following remaining issues:

  1. Skeletons should be separated from patterns
  2. Time symbols should be separated from date symbols

sffc avatar Jun 11 '21 19:06 sffc

In #791 I'm introducing conditional symbols loading.

I imagine the next step would be to:

  • separate skeletons as its own key, and load it only if options uses components
  • separate time symbols from date symbols and adjust analyze_pattern to inform on which of the two (or both) needs to be loaded.

zbraniecki avatar Jun 12 '21 14:06 zbraniecki

I think further splits should wait for no alloc provider because only then we'll be able to reason about actual wins.

zbraniecki avatar Jun 12 '21 18:06 zbraniecki

@sffc In your Oct 31 2020 comment you said:

Format Patterns

datetime/patterns@1 covers long/medium/short date patterns, time patterns, and glue patterns datetime/skeletons@1 covers availableFormats: the mapping from skeletons to patterns

Can you help me reason about holding patterns@1 vs time_patterns@1, date_patterns@1 and date_time_patterns@1?

This split would make it easy to destruct each payload into a single pattern, but maybe at some point having DateTimePattern have to ask for 4-5-6 traits and 3-4 payloads becomes costly? I'm not sure how to strike the best balance here.

zbraniecki avatar Oct 01 '21 14:10 zbraniecki

I'll do some of that for #519.

zbraniecki avatar Oct 01 '21 16:10 zbraniecki

I'm hoping that @gregtatum's design work today puts us on a path to eventually resolve these questions about what should be in the data keys. We should have a deeper discussion on this. In the mean time, the expedient thing is probably to avoid disrupting the data provider resource key layout in major ways while we are adopting ZeroVec.

sffc avatar Oct 01 '21 18:10 sffc

This should be one of the last things to do in 1.0 after DTF has stabilized.

We should do this before 1.0 because it impacts data file stability.

sffc avatar Jan 27 '22 18:01 sffc

I think this has two parts, one part is the data keys for the ECMA-402 compatible components bag, and the other is for the ideal components bag. Blocking for 1.0 will be ensuring we have the best split for the ECMA-402 compatible components bag.

gregtatum avatar Mar 29 '22 21:03 gregtatum

Make sure to look at the data representation of the glue pattern and make changes if necessary for future-proofing. See #1131

sffc avatar May 27 '22 18:05 sffc

Action: @sffc to split off remaining work into a new 2.0 issue and close this one.

sffc avatar Jul 28 '22 17:07 sffc

We have split symbols from patterns and date from time; this is sufficient for the first release. I would still like to explore even more-granular splitting, but there's no time in 1.0 and we should coordinate this with the Ideal Components Bag work.

sffc avatar Jul 31 '22 05:07 sffc

The neo date time format stuff does this

Manishearth avatar Feb 23 '24 23:02 Manishearth