icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Loading cache data for lunar calendars

Open sffc opened this issue 2 years ago • 9 comments

@Manishearth - This is 1.3 blocking because the API of a calendar that has no data versus one that does is different. We could theoretically remove constructors from these types for 1.3 and only do things through AnyCalendar. But then these APIs are functionally useless outside of AnyCalendar.

Parts:

  • Common:
    • [x] https://github.com/unicode-org/icu4x/pull/4051
    • [x] Prepare calendars for caching on ArithmeticDate https://github.com/unicode-org/icu4x/pull/4407
  • [x] Chinese / Dangi
    • [x] Add caching on ArithmeticDate https://github.com/unicode-org/icu4x/pull/4411
    • [x] Optimize data https://github.com/unicode-org/icu4x/pull/4465
    • [x] Add precomputed data loading https://github.com/unicode-org/icu4x/pull/4468
  • [x] Hebrew
    • [x] Add optimized Four Gates-based code https://github.com/unicode-org/icu4x/pull/4500
    • [x] Cache Keviyah on ArithmeticDate https://github.com/unicode-org/icu4x/pull/4504
    • [x] Consider adding a new() and deprecating new_calculating() (and the try_new...with_calendar(), if we are definitely not precomputing (otherwise add precomputing) https://github.com/unicode-org/icu4x/pull/4532
  • [ ] Islamic
    • [ ] Write optimized YearInfo code per-calendar (at least UAQ and observational), with a shared YearInfo type
    • [ ] Use in cache on ArithmeticDate
    • [ ] Add precomputed data

sffc avatar Aug 24 '23 17:08 sffc

There's a PR open for Chinese (https://github.com/unicode-org/icu4x/pulls), but islamic/hebrew also need similar work done.

Manishearth avatar Aug 24 '23 17:08 Manishearth

Name of the constructor that does not load data:

  • Chinese::new_without_data()
  • Chinese::new_calculating()
  • Chinese::new_without_calculated_data()
  • Chinese::new_runtime_calculating()
  • Chinese::new_always_calculating()

Discussion:

  • @robertbastian - I prefer short
  • @Manishearth - I don't prefer short

Conclusion: new_always_calculating() for the no-data constructor. new() is the constructor with compiled data.

Next steps:

  • The following calendars need work: Chinese, Dangi, all Islamic, maybe Hebrew
  • Make the following changes:
    • Change these calendars to have new_always_calculating() constructor
    • Make sure they are not Copy

It would be nice to have Chinese ready with the pre-calculated caches in 1.3 but does not need to block.

LGTM: @Manishearth @sffc @robertbastian

sffc avatar Sep 05 '23 18:09 sffc

#4051 solves this for the milestone, kicking the can to 1.4

Manishearth avatar Sep 19 '23 23:09 Manishearth

The plan is:

  • CalendarArithmetic takes in a &PrecomputedData type parameter on all of its methods. Most calendars use () https://github.com/unicode-org/icu4x/pull/4378
  • ArithmeticDate contains a YearInfo field which also defaults to (). Date mutations must update it.
  • The PrecomputedData is stored on the Calendar object, whereas the YearInfo, as a field of the ArithmeticDate, is stored on the Date. The calendar passes in PrecomputedData where necessary
    • Note: this is different from the current code; which stores both caches on the Date! This design is incorrect and duplicates the data.
  • Methods which can rely on YearInfo will do so. Methods which cannot must hit the precomputed data store to obtain year info.

One thing I'm not sure about is if YearInfo should ever differ from what is stored in precomputed data. Currently the YearInfo equivalent in tree for chinese is:

pub(crate) struct ChineseBasedCache {
    pub(crate) new_year: RataDie,
    pub(crate) next_new_year: RataDie,
    pub(crate) leap_month: Option<NonZeroU8>,
}

whereas the precomputed cache is:

pub(crate) struct ChineseBasedCompiledData {
    pub(crate) new_year: RataDie,
    /// last_day_of_month[12] = last_day_of_month[11] in non-leap years
    /// These days are 1-indexed: so the last day of month for a 30-day 一月 is 30
    /// The array itself is zero-indexed, be careful passing it self.0.month!
    last_day_of_month: [u16; 13],
    ///
    pub(crate) leap_month: Option<NonZeroU8>,
}

(This packs down to three bytes in PackedChineseBasedCompiledData)

If we are going to store such data live, we might as well store PackedChineseBasedCompiledData everywhere, which simplifies the architecture down to "load PackedChineseBasedCompiledData into YearInfo, if you can't load from precomputed data then compute on your own". In the long run this may mean we don't need to pass in PrecomputedData everywhere, though in the short run I'd like to grant us that freedom.

However, if we make that call now, we can avoid having to pass in the compiled data for everything other than the until() and offset_days() which is nice.

Manishearth avatar Nov 28 '23 23:11 Manishearth

Discussed a bit with @sffc:

In general we think that YearInfo and PrecomputedData will not need to differ on the data stored per-year (they may differ on the data storage format).

This means that ArithmeticDate only needs to worry about PrecomputedData for the offset/until functions, which is a far lower footprint.

The design becomes:

  • ArithmeticDate gets YearInfo, PrecomputedData, load_yearinfo(data: &PrecomputedData)
  • The offset/until functions take in data: &PrecomputedData. All other ArithmeticDate methods do not
  • All CalendarArithmetic methods are expected to hit YearInfo for their stuff. Callers are expected to preserve YearInfo when mutating years, ideally loaded via load_yearinfo().

The main downside here is that when working without loaded data, it may be slower to calculate the YearInfo for an entire year since we will have to calculate things for multiple months at a time. This is not a cost we ever have to pay right now. On the other hand, it will probably amortize nicely.

Another downside is that this will increase the size of Date whenever our largest calendar decides to cache more things. This is a price we think is ok to pay since Date is mostly ephemeral.

Manishearth avatar Nov 28 '23 23:11 Manishearth

ugh, found some gnarly stuff that will not be easy to optimize

https://github.com/unicode-org/icu4x/blob/c755786f6482daf0fb277ffdad001a53950082b3/components/calendar/src/chinese.rs#L311

This needs the number of days in the previous year. We might just also cache that but it's suboptimal, it's only needed for the rather rare case of week-of-year formatting, but it needs to be computed mandatorily when formatting

Manishearth avatar Dec 06 '23 02:12 Manishearth

@sffc We need to make a decision based on the benchmarking results:

Consider adding a new() and deprecating new_calculating() (and the try_new...with_calendar(), if we are definitely not precomputing (otherwise add precomputing)

For background, currently Hebrew must be constructed with new_always_calculating()

Proposal: We commit to never precomputing, add ::new(), make Hebrew constructable as a singleton struct, and deprecate the old APIs. For Hebrew only.

Approval:

  • [x] @sffc
  • [x] @Manishearth
  • [ ] (optional) @zbraniecki
  • [ ] (optional) @hsivonen

Manishearth avatar Jan 16 '24 01:01 Manishearth

Does keviyah work over all dates or are there cases where we would need to fall back to the book calculations? If keviyah is a true drop-in replacement, then yeah, we can just use it all the time and make Hebrew independent of data.

sffc avatar Jan 16 '24 18:01 sffc

All dates (in range, which is a range of ~i32 years I think: check the code)

Prior versions of the Hebrew calendar used slightly different four gates tables, Adjler documents them all, but the book does not handle that either, it implements the modern set of calculations.

The only question here is if it is fast enough for us to never want to add data loading. The benchmarks convince me but they are definitely at a level where I would not be surprised if others were not convinced.

Manishearth avatar Jan 16 '24 19:01 Manishearth

I've started working on the islamic ones. My current data model is to store month length booleans and then use the remaining 4 bits to store a new years offset from the mean synodic new year (year * 12 * MEAN_SYNODIC_MONTH). It's unlikely that that number will be super large.

Manishearth avatar Apr 02 '24 23:04 Manishearth