icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Decide on potentially per-calendar behavior near extreme date values

Open Manishearth opened this issue 3 months ago • 39 comments

Our calendars currently use i32 year values, and there are a couple ways that that can cause arithmetic overflow. The simplest way is when converting between different eras: i32::MAX reiwa is going to be an out-of-range extended_year. More subtle cases happen due to use of various calendars for internal calculation, or things like "get the R.D. as milliseconds". Some of these cases can be solved by using wider integer types, but by and large i32::MAX is not really a useful year to try and shove into this API.

We should ideally never be debug asserting in a way reachable to users.

https://github.com/unicode-org/icu4x/pull/7062 attempts to improve the situation by adding coarse guards to pan-Calendar construction APIs: all year/extended_year inputs to from_fields/from_codes are now checked to be within ±2²⁷. It does not attempt to do anything about Date::try_new_from_specificcalendar, calendar conversions, or calendar arithmetic.

https://github.com/unicode-org/icu4x/pull/7065 goes further to remove most sources of overflow errors except the small ones really close to i32::MAX. There's an open question there about Chinese: whether we should do millisecond math in i128, or error on those large dates.

We can and should try and expand the coarse year range. @robertbastian thinks we can get it to be ±2³⁰ (i32::MAX / 2). That would in and of itself be quite good, I think it would require us to run the simple chinese approximation in i128.

Regardless of the expansion of the range, we have some decisions to make. Note that all of these basically impact really large dates, which don't necessarily actually matter too much. It mostly matters that applications can use this code without experiencing strange behavior when users input overlarge dates.

The main one is that whether these ranges should be per calendar or pan-calendar. Per calendar means that we can be more or less liberal for individual calendars, pan calendar means that all calendars. This would mean that calendar conversions cannot be used to produce dates that cannot be directly produced, and thus all dates continue to roundtrip without error.

#7062 avoids the question by making it an input constraint on specific APIs: it's not really trying to figure out whether it should be per-calendar or pan-calendar, it is just saying that any year input on a pan-calendar API has a rough constraint.

If we decide it should be per-calendar, what happens on calendar conversion?

In either case, what happens on out-of-range duration arithmetic?

Note that these answers are entangled with implementation detail: it's not sufficient to say that we have per-calendar R.D. ranges, because from an implementation perspective you may need to guard against something before you have an R.D. to work with.

Manishearth avatar Oct 13 '25 15:10 Manishearth

ugh, typed up a whole issue body and then clicked cancel by accident

Manishearth avatar Oct 13 '25 15:10 Manishearth

Note that these answers are entangled with implementation detail: it's not sufficient to say that we have per-calendar R.D. ranges, because from an implementation perspective you may need to guard against something before you have an R.D. to work with.

Generally this reason is why I think the only strategy we should have is the current one of restricting inputs: if someone creates an "out of range" date by calendar conversion, that's fine: it's not going to be too much larger than the range min/max. It's also better for documentation, when a calendar (or the crate) can declare the valid range for year or extended_year.

We can decide on whether the input restriction needs to be global or per calendar, or per-API even.

Manishearth avatar Oct 13 '25 17:10 Manishearth

cc @sffc

Manishearth avatar Oct 13 '25 17:10 Manishearth

I remember at some point I had a RataDie limit that was i64 / 256 or something, in order to give "plenty" of room for transient calculations to use more bits in the RataDie.

We complain about Temporal's limits a lot, and it "only" requires 200k years. We don't need to go too much higher than that.

I had suggested 1e6 extended years, because I prefer round numbers in the Decimal system for human-imposed limits.

sffc avatar Oct 13 '25 17:10 sffc

We complain about Temporal's limits a lot, and it "only" requires 200k years. We don't need to go too much higher than that.

FWIW I complain about Temporal's limits there because it is a spec and it mandates particular behavior there. I think specifications should be very careful about what they mandate, and it feels silly to mandate 200k years when most of the calendars do not have meaning in that range.

As a library I am more willing to try and exercise some freedom here. It's nice for our ISO calendar at least to have as much validity as possible. I'm not strongly on the side of "let's have a huge validity range", but it seems like if it's easy to attain, why not?

Manishearth avatar Oct 13 '25 18:10 Manishearth

I'd be fine making the limit RD-based instead of year-based.

+/- 1e8 RD? (the smallest power of 10 that is greater than the Temporal limit)

Each calendar is required to return an error when attempting to construct a value exceeding that limit. This can be implemented cleanly for most ICU4X calendars by adding a trait method to CalendarArithmetic and then checking the ranges in the ArithmeticDate constructor.

Date::try_from_fields continues to enforce its safe extended year range to a number that exceeds the limits of the RD range in all calendars. I think 1e6 still works if the RD limit is 1e8.

To summarize:

  1. User calls Date::try_from_fields.
  2. Date::try_from_fields checks in a calendar-agnostic way whether the extended or era year exceeds +/- 1e6. If it does, it immediately returns a RangeError.
  3. Date::try_from_fields calls calendar-specific code to convert the given fields to a calendar YMD.
  4. Date::try_from_fields calls calendar-specific code to check whether the calendar YMD, when converted to a RD, is within RD +/- 1e8. Calendars can optimize this check to avoid actually converting to an RD. If the check fails, return a RangeError.
  5. Done, return an Ok result.

sffc avatar Oct 13 '25 18:10 sffc

I think that's a decent plan, and I appreciate the implementation details (since they're a part of the challenge here) The two-part range check on years and on RDs would make it work. We'd need to be careful to insert the RD checks in all the right places, but it's doable.

Manishearth avatar Oct 13 '25 21:10 Manishearth

My post above focuses on try_from_fields (and try_new_from_codes which is a wrapper over try_from_fields). To address the other constructors (CC @robertbastian):

  • from_rata_die is infallible. This suggests that RataDie's constructors should enforce the same range. Unfortunately RataDie::new is also infallible. (This is an example of why I really didn't want to graduate RataDie as part of icu_calendar's API.) So, we should mark either Date::from_rata_die or RataDie::new as deprecated and introduce a fallible variant.
  • new_from_iso is infallible. This is fine, because Date<Iso> complies by these same bounds.
  • from_raw should probably be named from_raw_unchecked to emphasize that it is skipping invariant checks and can produce panicky code. I suggest a deprecate-and-rename.
  • try_new_buddhist, try_new_coptic, ... are already fallible, and they share some code with try_from_fields, so the bounds check should be implementable in those constructors.

sffc avatar Oct 14 '25 02:10 sffc

Slight preference for deprecating from_rata_die, but I recognize that's much more churn than deprecating RataDie::new(). I'd like calendrical_calculations to not be beholden to any RD limits if possible.

from_raw should probably be named from_raw_unchecked to emphasize that it is skipping invariant checks and can produce panicky code. I suggest a deprecate-and-rename.

Is it even possible to make a raw SomeCalendar::Inner from outside of icu_calendar? Most of these types are internally private and unconstructible, IIRC.

Manishearth avatar Oct 14 '25 06:10 Manishearth

from_rata_die already saturates at the range of ArithmeticDate, we can just have it saturate in a tighter range. I do not want to deprecate it.

Is it even possible to make a raw SomeCalendar::Inner from outside of icu_calendar?

No

robertbastian avatar Oct 14 '25 07:10 robertbastian

If we limited years to 2^23 (or 2^20 if we want to be more round), we could pack an ArithmeticDate into 4 bytes. Months need 4 bits, days need 5. This would require a little work when YearInfo is not i32, but I think it's feasible.

robertbastian avatar Oct 14 '25 13:10 robertbastian

add will also have to either saturate or return errors when it leaves the safe range.

robertbastian avatar Oct 14 '25 13:10 robertbastian

Seeking consensus on the following:

  • Date is defined to be any date between two specific RataDies, MIN and MAX.
  • All fallible functions in icu_calendar that return a Date must enforce this range and return an error if the input parameters would produce an RD exceeding this limit.
  • The only known infallible function that could return an out-of-bounds Date is Date::from_rata_die. It should saturate the RD to be between MIN and MAX.

Corollaries:

  • Date::add can and should exit early if the resulting Date would be out of range (#7077, #7207)
  • Calendar-specific code can assume that it is never given an RD that is too much bigger (~1 order of magnitude) than MIN and MAX. For example, the EastAsianTraditional approximation can continue to use debug-panicky math.

Open for discussion:

  • MIN is -1e8 RD, MAX is +1e8 RD.
  • Add a new function Date::try_from_rata_die that returns an error if the RD is out of range.
    • The error variant could have a saturation convenience function: Date::try_from_rata_die(rd).unwrap_or_else(DateFromRataDieError::saturate)
    • Or it could return a custom Result that derefs to a normal Result but adds a saturation function: Date::try_from_rata_die(rd).unwrap_or_saturate()
  • Also deprecate Date::from_rata_die.

sffc avatar Nov 05 '25 01:11 sffc

  • MIN is -1e8 RD, MAX is +1e8 RD.

I'd rather anchor this in round Gregorian dates than round Rata Die. I'd also prefer using a power of 2.

  • Add a new function Date::try_from_rata_die that returns an error if the RD is out of range.

    • The error variant could have a saturation convenience function: Date::try_from_rata_die(rd).unwrap_or_else(DateFromRataDieError::saturate)
    • Or it could return a custom Result that derefs to a normal Result but adds a saturation function: Date::try_from_rata_die(rd).unwrap_or_saturate()
  • Also deprecate Date::from_rata_die.

If we expose the bounds somewhere (Date::MAX_RATA_DIE), clients can do that check themselves if they really want to. In general I'd rather not make functions fallible just for the purpose of date range, because then callers that do normal things have to handle failures every single time.

With bit packing, the lowest representable date would be RD -2**22*365.25+1 (365.25 is the longest average year length of any calendar), which is larger in magnitude than both 1e10 and 2**30.

robertbastian avatar Nov 05 '25 11:11 robertbastian

Broadly in favor of this plan.

Don't have strong opinion on min/max RD.

Manishearth avatar Nov 05 '25 16:11 Manishearth

I want the bounds to be big enough to handle known users and not much bigger than that. Temporal is the known user with the largest range.

It appears #7219 makes the limit RD +/- 366e6, which is on the same order of magnitude that I had previously suggested (1e8), so it seems fine to me. I think I prefer using a "round" number in RD space rather than ISO Year space given that the bound is pan-calendar RD-based.

sffc avatar Nov 05 '25 20:11 sffc

FWIW the bound is also pan-calendar extended-year based in from_fields, but that is not the fundamental invariant, it constrains further than the fundamental invariant.

Manishearth avatar Nov 05 '25 21:11 Manishearth

Actually no, sorry, the year range is the fundamental invariant. The RD range is not fundamental, it just provides a way of doing an early prefilter.

So it makes sense for the year range to be "round"

Manishearth avatar Nov 05 '25 21:11 Manishearth

A more detailed review of #7219 suggests that the range is two-part: an extended year range of 1e6, coupled with an RD range of 3.66e8, where different constructors use different ranges. I left a comment in the PR asking whether we want to use a consistent range. There's also an assumption I haven't verified that the extended year range is narrower than the RD range in every calendar.

sffc avatar Nov 05 '25 21:11 sffc

@sffc No, the RD range is just an early check: from_rata_die is the only constructor applying the RD range, and that calls down to calendar code, which then will apply the year range.

Like I said, the year range is fundamental, the RD range is a prefilter. Its job is to avoid "mathematically problematic" values from reaching calendar-specific code. Calendar-specific code may then apply the real year range.

Manishearth avatar Nov 05 '25 21:11 Manishearth

There's also an assumption I haven't verified that the extended year range is narrower than the RD range in every calendar.

For example, is the Ethiopic Amete Alem date (-1e6, 1, 1) within the RD range? I think maybe it isn't. Haven't verified 100%.

sffc avatar Nov 05 '25 21:11 sffc

@sffc No, the RD range is just an early check: from_rata_die is the only constructor applying the RD range, and that calls down to calendar code, which then will apply the year range.

No, it calls a calendrical_calculations function and then uses ArithmeticDate::new_unchecked. The ranges are constructor-dependent.

sffc avatar Nov 05 '25 21:11 sffc

No, it calls a calendrical_calculations function and then uses ArithmeticDate::new_unchecked. The ranges are constructor-dependent.

Oh, you're talking about the individual from_rata_die impls.

Yeah that's exactly what I was talking about when I said I'm not fully convinced yet in https://github.com/unicode-org/icu4x/pull/7219#issuecomment-3493164700 :) But it was something I planned to go through and verify properly later.

Perhaps ArithmeticDate::new_unchecked should be clamping year?

The only problem is that cross-calendar conversions will be wobbly then. I'm kind of okay with that?

As mentioned in your plan above we should be deprecating from_rata_die anyway.

I would prefer for there to be a single fundamental range being applied. Perhaps in that case it must be R.D, but that forces us to check R.D. on all construction.

Manishearth avatar Nov 05 '25 21:11 Manishearth

The RD range is the fundamental range, the invariant in ArithmeticDate is that the date it represents is inside the RD range.

The year range is tighter than the RD range, for the purpose of communicating clear cross-calendar valid year ranges in constructors. If we applied RD ranges to YMD constructors, every calendar's constructor would have different ranges, and the ranges would not be on year boundaries. It's possible to create the max Gregorian date and convert that to Hebrew, whereas directly constructing this Hebrew date is impossible. We could allow it, but the added complexity in the constructor documentation is not worth it imho.

robertbastian avatar Nov 05 '25 21:11 robertbastian

The year range is tighter than the RD range

Doesn't this mean that there are dates which can be constructed but will not roundtrip with themselves? That seems unfortunate.

Manishearth avatar Nov 05 '25 21:11 Manishearth

if by round-trip you mean round-trip through YMD then yes. round trip though conversion to other calendars, no, those will always work, round trip through RD also.

robertbastian avatar Nov 05 '25 21:11 robertbastian

I would prefer for there to be a single fundamental range being applied. Perhaps in that case it must be R.D, but that forces us to check R.D. on all construction.

It's possible to create the max Gregorian date and convert that to Hebrew

I was thinking that we would have a const ArithmeticDate<Hebrew> that we pre-calculate to be the RD min/max (with a test) and then we can < / > against it. No conversion to RD is necessary

sffc avatar Nov 05 '25 21:11 sffc

if by round-trip you mean round-trip through YMD then yes

Yeah. It's a property I am not strongly opposed to losing, but I would like to make sure we acknowledge that we're losing that property.

Manishearth avatar Nov 05 '25 22:11 Manishearth

I was thinking that we would have a const ArithmeticDate<Hebrew> that we pre-calculate to be the RD min/max (with a test) and then we can < / > against it. No conversion to RD is necessary

I do not want to have a different range for each calendar, and I do not want the valid range to start in the middle of a year.

robertbastian avatar Nov 05 '25 22:11 robertbastian

I think I value "round-trips through YMD" higher than "range boundary is at a year boundary" but I'm willing to be convinced otherwise if we can articulate the pros and cons.

sffc avatar Nov 05 '25 22:11 sffc