icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Create an Ideal Components Bag / Skeleton for DateTimeFormat

Open gregtatum opened this issue 4 years ago • 22 comments

This is a meta issue to track implementing the "ideal components bag" as laid out in the DateTimeFormat Deep Dive 2021-10-01 design document. Originally there was some discussion to have this replace the current components bag, but it is to be implemented alongside the existing components bag. A better name can be bikeshed if needed.

The following need to be completed.

  • [ ] #1318
  • [ ] #1319
  • [ ] #1321
  • [ ] #1325
  • [ ] #1322
  • [ ] #1323
  • [ ] #1324

gregtatum avatar Nov 18 '21 17:11 gregtatum

@gregtatum will provide mentorship.

sffc avatar Jan 27 '22 19:01 sffc

I'm spreading the word about this issue looking for candidates.

More details:

Description: Currently, DateTimeFormat has two ways to select the right format, both of them are imperfect. We believe we have a balanced novel solution that, once implemented, will become the foundational use of the DateTimeFormat. Scope: We believe that the initial implementation should take one person several (2-3) months to implement. Hopefully in time for ICU4X 1.0. Mentorship: This project is well staffed on the mentorship side with @gregtatum from Mozilla, @sffc from Google and @zbraniecki from Amazon willing to invest time to mentor the engineer who'll pick it. How to start: If you are interested in the project, comment in this issue or join unicode-org.slack.com #icu4x and we'll get you on-ramped.

zbraniecki avatar Feb 07 '22 22:02 zbraniecki

I'm interested to work on this issue.

ozghimire avatar Feb 10 '22 16:02 ozghimire

@gregtatum are you still open to mentor?

zbraniecki avatar Feb 10 '22 16:02 zbraniecki

If this issue is still open, I'm definitely interested to work on this.

pdogr avatar Feb 11 '22 14:02 pdogr

@ozghimire Great! How would you prefer to get started? There is a document linked above outlining the strategy which should discuss how to get things going. I would suggest starting with #1318. I will fill in more details on that issue.

@pdogr I think ozghimire is taking the first step on this to move it forward, and it's hard to parallelize this initial step, but there will probably be work to help out on around the issues. You could take another DateTimeFormat issue to get onboarded. I'm sure there will be opportunities to help in the short term. #1581 would be a good bug to onboard with if you wanted to take it.

gregtatum avatar Feb 15 '22 20:02 gregtatum

Hello @gregtatum, are still looking for contributors?

randomicon00 avatar Mar 29 '22 23:03 randomicon00

Not sure if you're looking for feedback, but if there's way you could improve the user-friendliness of the config, that would be really helpful. I'm a Rust newbie, and was pretty confused on how to use this feature. So if I do something like this, I get errors about a long list of missing fields:

    let const DAYMONTH = components::Bag {
        year: Numeric,
        month: Short,
    }

Here's the equivalent in JS:

const date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0));

const options1 = {
  month: "long",
  day: "numeric",
};

const options2 = {
  month: "short",
  day: "numeric",
};

const options3 = {
  month: "numeric",
  day: "numeric",
};

console.log("Option 1:")
console.log(date.toLocaleString("de-DE", options1));
console.log(date.toLocaleString("en-US", options1));
console.log(date.toLocaleString("es-PE", options1));
console.log("")
console.log("Option 2:")
console.log(date.toLocaleString("de-DE", options2));
console.log(date.toLocaleString("en-US", options2));
console.log(date.toLocaleString("es-PE", options2));
console.log("")
console.log("Option 3:")
console.log(date.toLocaleString("de-DE", options3));
console.log(date.toLocaleString("en-US", options3));
console.log(date.toLocaleString("es-PE", options3));

... which generates:

> "Option 1:"
> "19. Dezember"
> "December 19"
> "19 de diciembre"
> ""
> "Option 2:"
> "19. Dez."
> "Dec 19"
> "19 dic."
> ""
> "Option 3:"
> "19.12."
> "12/19"
> "19/12"

bdarcus avatar Jun 12 '23 00:06 bdarcus

@bdarcus Until https://github.com/rust-lang/rust/issues/70564 lands, you need to create a mutable components bag and then set your fields on it.

let mut components_bag = components::Bag::default();
components_bag.year = components::Numeric::TwoDigit;
components_bag.month = components::Month::Short;

sffc avatar Jun 12 '23 17:06 sffc

@sffc, @eggrobin, and I discussed @eggrobin's WIP skeleton work at https://github.com/unicode-org/icu4x/compare/main...eggrobin:icu4x:%CF%83%CE%BA%CE%B5%CE%BB%CE%B5%CF%84%CE%AC (https://github.com/unicode-org/icu4x/commit/716672be5bb7b6d3adf136d14fdeaf40c196b982 + previous commits)

The general plan moving forward is that skeleta will be represented using @eggrobin's design which essentially boils down to an enum for (day, time, date, datetime) + a length, plus additional timezone stuff. This makes for 12 possible time components, 9 day components, 8 non-day date components (so 17 total date components), and 12 × 17 combined DateTime components, with three lengths for each.

There is a fallback algorithm that CLDR uses, which is implemented in ICU4X as get_best_available_format_pattern. We move this to datagen and perform the simpler subset of the fallback algorithm that falls back between lengths. In other words, we always generate data for each of the 12/17 skeleta and use the fallback algorithm to find suitable replacement data when not present, but we do not necessarily generate data for each of the lengths.

For the data model, we can store Date/Time/DateTime as separate keys, with the first three having a data model of:

/// For Date/Time only, not datetime
struct PackedSkeletonData<'data> {
   pub indices: ZeroVec<'data, SkeletonDataIndex>, // len = 12 for time, 17 for date
   pub patterns: VarZeroVec<'data, PatternPluralsULE>,
}

// conceptually:
// {
//   has_long: bool,
//   has_medium: bool,
//   has_short: bool,
//   index: u16, // index of first pattern (long if present, else medium, else short)
// }
#[derive(ULE)]
struct SkeletonDataIndex(u16);


struct DateTimeSkeletons<'data> {
   // will typically be small, there are only a couple special cases like E B h m
   map: ZeroMap<'data, Skeleton, PatternPluralsULE>, 
}

For date or time lookup, based on the skeleton we index into the indices array and perform fallback on the available lengths in the metadata. The data is stored contiguously as [long?, medium?, short?] so we can calculate its index by offsetting from the base index, and then fetching.

For datetime lookup, we first index into the DateTimeSkeletons map, and if not present, we then go fetch the individual date and time data and glue them together using the glue from the datetime lengths data.

Manishearth avatar Jul 06 '23 16:07 Manishearth

When we fix this we should also fix https://github.com/unicode-org/icu4x/issues/3762

Manishearth avatar Oct 19 '23 17:10 Manishearth

  • @sffc - How can we change the encoding to flatten PatternPlurals into this index lookup?
  • @zbraniecki
[12][K][V][K2][V2]
* K - Plural/Declension/Etc
* V - 0, 1, 2, 3 - Plural Form
  • @sffc - Right now we have 16 bits, of which 3 are for length. What if we used an addition 5 to encode the plural variants.
Key:
[all have full]
- has_long
- has_medium
- has_short
[all have other]
- has_zero
- has_one
- has_two
- has_few
- has_many

Or another model:

[all have full]
- has_long
- has_medium
- has_short
- has_six_plurals

Or make a model that stores different sets of plurals in only 2 bits.

@sffc and @Manishearth to work on this after finishing neo symbols.

sffc avatar Nov 09 '23 23:11 sffc

https://github.com/unicode-org/icu4x/issues/1317#issuecomment-1623963015 is a design for how to store skeletons in the data file, but it doesn't directly address the question of knowing ahead of time which names and name lengths to include.

With semantic skeleta, are any of the following invariants true (across all locales and calendars)?

  1. If the skeleton does not have Weekday, then the pattern does not have Weekday.
  2. If the skeleton has a short Weekday, then the pattern has a short Weekday.
  3. If the skeleton has a long Weekday, then the pattern has a long Weekday.
  4. If the skeleton does not have Month, then the pattern does not have Month.
  5. If the skeleton has a numeric Month, then the pattern has a numeric Month. (trick question! I know this one happens to be false in the Hebrew calendar)
  6. If the skeleton has a short spellout Month, then the pattern has a short spellout Month
  7. If the skeleton has a long spellout Month, then the patterh has a long spellout Month

And similar for Era, Day Period, and Time Zone.

Depending on which of these invariants work out, we should be able to have static analysis of a skeleton to produce an auto-sliced data bundle.

sffc avatar Mar 05 '24 09:03 sffc

My understanding from @eggrobin on the above questions is:

  1. Including Day-of-Month could imply including Weekday, because they are both ways of representing specific dates
  2. Including Month does not imply including Weekday, because a weekday represents a specific date, not a month
  3. Can't make any guarantees about the width of the fields

sffc avatar Mar 11 '24 19:03 sffc

Notes from this topic in the ICU4X-TC meeting on 2024-07-11:

https://docs.google.com/presentation/d/1qXxBv4DVnqfBSpGt9ikVQLk9M0LX65O9lWvDo0pH9SU/edit#slide=id.p

  • @zbraniecki - Is this a way to get weekday display names?
  • @sffc - LDML defines LLLL, and NeoFormatter will get that. It also defines standalone weekday (maybe c?).
  • @zbraniecki - User story: I want to collect names of week days of all 7 days in gregorian calendar. How do I do this? In DateTimeFormatter I need to find a date that is Monday and then +1 day and query, right?
  • @mihnita - There are many sigils for Timezone in CLDR (7 different ones?). How do you cover them?
  • @sffc - Time zones are a can of works we can discuss on a different day. I have a solution, I think.
  • @macchiati - It's valid to request a specifically numeric date
  • @sffc - That's what Short says: A short date; typically numeric, as in “1/1/2000”
  • @macchiati - That seems reasonable if we strengthen the docs a bit
  • @robert - what can go wrong in the fallible formatter?
  • @sffc - Not sure, it may fail
  • @robert - we should know what may fail. And we may want to add writeable infallible formatter that does debug assertions
  • @sffc - I'm comfortable with that.
  • @zbraniecki - You could generate the markers out of CLDR at build time if you wanted, right?
  • @sffc - Yes
  • @zbraniecki - What performance metrics have you been optimizing for?
  • @sffc - Memory, I looked at performance and it has been steady, but my focus is on memory
  • @sffc - For a single skeleta I got the memory from 6KB stack size to 560 bytes stack size, 90% win
  • @zbraniecki - How does it compare to ICU4C?
  • @sffc - Even the old ICU4X 1.0 beats ICU4C hands down. This is just another 10x improvement.
  • @sffc - Do we have consensus that this is the direction we want to go with ICU4X 2.0 datetime formatting?
  • @zbraniecki - I'm okay with it even given the limitations; we should document them.
  • @mihnita - Some of these decisions are UX decisions.
  • @younies - I like the design, but if we extend it for currency and units... it's not clear how it extends. I like how it would work for a subset of units. Maybe for common things like duration.
  • @sffc - To address @mihnita's concern, we have a path to add new entries to the enum, if we feel they are legitimate use cases. Custom patterns is an escape hatch until CLDR approves new semantic skeleta.
  • @younies - When we are shipping a customized data for the user, is that a security issue?
  • @sffc - Data is keyed on CLDR version and code, not on user.
  • @younies - Sounds good. I approve this.
  • @zbraniecki - I am also comfortable with this. Thank you for your work on it. I'm quite happy with the memory savings.
  • @mihnita - I agree that classical skeleta let you do a lot of bad things. I'm just not convinced semantic skeleta are sufficiently general. And custom pattern is bad i18n.
  • @sffc - Do we have agreement on this: "ICU4X implements semantic skeleta in Rust as presented today, pending CLDR approval of the semantic skeleta proposal."
  • @mihnita - The idea of having predefined skeleta seems good. It seems better if, in addition, ICU4X allowed developers an escape hatch for classical skeleta.
  • @sffc - To start, I would rather push people strongly to semantic skeleta, so that we can hear clearly where they don't fit their use case. So I would like to remove classical skeleta from the API in 2.0. We can re-evaluate when we have clear user needs.
  • @mihnita - How do you think this integrates with MessageFormat 2.0? It only needs date/time full/long/medium/short.
  • @sffc - Those are supported in my ICU4X proposal.

Statement seeking consensus:

  1. The ICU4X-TC approves the overall design of the ICU4X Rust code implementing semantic skeleta.
  2. If CLDR-TC approves semantic skeleta for LDML 46, the ICU4X-TC approves replacing the existing date time formatting classes with the new semantic skeleta formatting classes in the ICU4X 2.0 release.

sffc avatar Jul 11 '24 19:07 sffc