icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Should TimeZoneIdMapper support efficient storage and retrieval of non-canonical IANA IDs?

Open sffc opened this issue 1 year ago • 9 comments
trafficstars

See https://github.com/unicode-org/icu4x/issues/5533#issuecomment-2380240259

Currently, TimeZoneIdMapper has two data payloads:

  1. Trie: IANA string to BCP-47 ID (does NOT support random access: only key-to-value lookup)
  2. Map: BCP-47 ID to canonical IANA string (random access supported)

These data structures allow efficient implementation of the following operations:

  • IANA to BCP-47
  • BCP-47 to IANA
  • Normalize IANA
  • Canonicalize IANA

However, it does not support efficient storage or retrieval of a non-canonical IANA name. For example, it supports mapping from "europe/kiev" to any of the following:

  • "Europe/Kiev" returned as an owned String
  • "uaiev" returned as a TinyStr
  • "Europe/Kyiv" returned as &str

But we don't have a representation that allows the user to return "Europe/Kiev" later: only "Europe/Kyiv". The client would need to store the string on their own.

A few ways to enable this behavior with data:

  1. Change the map from BCP-47 ID to IANA string to also include non-canonical aliases, and return an opaque integer of some sort to select the correct alias
  2. Switch from using a Trie to using a Map so that random access is supported, and then let the client save the index of their non-canonical time zone in the payload
  3. Add a third payload where this is its only job

One concern I have is that any of these approaches involves returning an index of some sort into a structure. However, that structure is ephemeral and could change across ICU4X versions, so I don't want people storing these indices. We currently always return BCP-47 IDs because those have a stability policy.

An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.

Thoughts @justingrant?

sffc avatar Sep 27 '24 22:09 sffc

Thanks @sffc for starting this conversation. I don't yet have an opinion about the best solution here, but I've got some initial questions/comments below.

The main guidance I'd have (based on my experience with time zone canonicalization in ECMAScript) is that a time zone API should enable an efficient way to answer the following questions/usecases:

  1. Is this string a valid IANA ID?
  2. Do these two IANA IDs represent the same time zone, meaning they are aliases of each other in IANA? Note that this comparison should happen via a library method to dissuade users from using a naive string comparison, and because the data underlying this comparison will change across ICU4X versions.
  3. What is the normalized IANA ID for this IANA ID? (IANA strings should always be normalized before use, because they're case-insensitive)
  4. What is the list of time zones that I should use to build a UI picker? The list must include only one entry for each canonical IANA ID.
  5. How should I efficiently store a large number of time zones in memory, assuming that canonicalization *is* OK? Canonicalization is OK when the zone never needs to be converted back into an IANA ID.
  6. How should I efficiently store a large number of time zones in memory when canonicalization is *not* OK, like when a user's choice of IANA ID needs to be persisted and we want to respect the user's initial choice, despite future changes to IANA like Kiev=>Kyiv.

Does this list sound correct?

Does that list of use cases influence the proposed solutions in your OP?

Other than the last one in the list, are there others that current ICU4X doesn't meet?

that structure is ephemeral and could change across ICU4X versions

Well, we could choose to make the indexes stable over time if we wanted to (with newly-added zones being given new, larger indexes than existing zones), at the cost of being unable to use the indexes to provide a lexicographic sort order.

In other words from the caller's perspective they'd be IDs not indexes and would need to be documented as such. I'm not suggesting that we should make them stable, but we could.

If we did do this, then it'd be smart to randomize the order to avoid clients fooling themselves into thinking that the indexes were sorted.

If we choose to make an unstable ID visible to callers, then maybe naming it something like unstableId would be smart.

  • "uaiev" returned as a TinyStr

I assume you mean TinyAsciiStr?

An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.

One variant on this could be for ICU4X to provide some method that dynamically creates a data structure that callers will want, instead of requiring every caller to figure this out on their own.

justingrant avatar Oct 03 '24 00:10 justingrant

1-5 are all supported with current data, although we don't have an API yet for 4. But we already have a list of all canonical BCP-47 IDs in the data.

I'm definitely not in favor of returning any stable integer, because that just invites people to store them, and then we've created a new taxonomy. I think if we decided to support this, returning a string reference is about at far as I'd want to go.

I'm open to the idea of us using a string interning data structure internally.

sffc avatar Oct 03 '24 03:10 sffc

I'm definitely not in favor of returning any stable integer

I don't know enough Rust to know if this idea is practical, but could the memory savings be achieved without letting callers access the actual index value by wrapping the index as a private field in a public struct?

For example, a NamedTimeZoneId struct with an index: u16 private field, and with some public methods in the impl like these?

  • from_IANA factory method that accepts a string. Could also have from_BCP47.
  • to_IANA instance method that returns the IANA name as an (interned?) string. Could also have to_BCP47.
  • get_all_IANA static method to return the full list of normalized IANA IDs. Could also have get_all_BCP47.
  • Maybe an instance method that accepts a NamedTimeZoneId and returns true if both the argument and Self resolve to the same canonical ID, false otherwise. Could also be a static method.
  • For serialization/deserialization, use the IANA string.

Apologies if my Rust inexperience is showing here, if this is not practical.

justingrant avatar Oct 03 '24 05:10 justingrant

Yeah, we could return a wrapped thing, or a string reference, if we have the data.

sffc avatar Oct 03 '24 12:10 sffc

How many bytes does a string reference cost in Rust? Would a wrapped u16 be cheaper?

justingrant avatar Oct 04 '24 00:10 justingrant

Well, for the best user experience, a wrapped u16 would still want to retain a reference to the string interning structure or data payload so that it can be dereferenced to a string, so it ends up being the same size as a string reference. I feel like at that point we're splitting hairs; if you really want to just store a u16 and nothing else, then you should just do it yourself.

sffc avatar Oct 04 '24 00:10 sffc

I was assuming that the list of interned strings would be immutable and global so wouldn't need a reference to it. But this assumption might be wrong?

justingrant avatar Oct 04 '24 00:10 justingrant

I was assuming that the list of interned strings would be immutable and global so wouldn't need a reference to it. But this assumption might be wrong?

We have a policy of no global caches in ICU4X (unlike ICU4C) because there's no one size fits all solution to caching. So the cache would need to be owned by an instance of an object, and the u16 key would need to always be passed into the same instance...

Would you be happy enough with a cookbook example in our docs about how to do this in userland?

sffc avatar Oct 04 '24 01:10 sffc

Oh, OK. If global caches are out of scope, then yes a cookbook example makes sense for this case, at least as a starting point to see if it's ergonomic enough.

Ignoring the u16 vs. interned IANA string vs. BCP-47 TinyAsciiStr which is just an optimization, my main concern would be that if the most ergonomic solution is to canonicalize IDs, then it's easier for the ecosystem to end up in a place where renaming an IANA zone causes lots of code to break.

At a minimum, it seems like there should be stable BCP47 IDs for all non-canonical IANA zones, and that ICU4X should not canonicalize by default. So users would have to opt into canonicalizing. IIRC <25% of IANA IDs are Links (aka non-canonical), in case that matters in measuring the impact of not canonicalizing.

justingrant avatar Oct 04 '24 21:10 justingrant

At a minimum, it seems like there should be stable BCP47 IDs for all non-canonical IANA zones, and that ICU4X should not canonicalize by default. So users would have to opt into canonicalizing. IIRC <25% of IANA IDs are Links (aka non-canonical), in case that matters in measuring the impact of not canonicalizing.

@sffc where did this end up?

justingrant avatar Jan 25 '25 00:01 justingrant

@robertbastian has been thinking a lot about this

sffc avatar Jan 31 '25 16:01 sffc

I don't think the list of use cases sounds correct. I think it gets too hung up on IANA IDs, when all they are are a way to identify a particular "zone". If both Europe/Kiev and Europe/Kyiv point to the same tzif file (they do), and produce the same display names in UTS-35 (they do), then we should be allowed to use them interchangeably. Other users of IANA IDs should also treat them interchangeably, and if they don't, that's their problem (IANA after all defines them as links).

Is this string a valid IANA ID?

This is equivalent to asking "does TZDB provide data for this string?", which is equivalent to asking "does CLDR define a time zone that this string aliases to?". We maintain this to be equivalent, and ICU4X answers the latter question in order to answer the former.

Do these two IANA IDs represent the same time zone, meaning they are aliases of each other in IANA? Note that this comparison should happen via a library method to dissuade users from using a naive string comparison, and because the data underlying this comparison will change across ICU4X versions.

We actually don't answer this question at the moment, and it's not the right question to ask for i18n anyway. Europe/Oslo is an alias to Europe/Berlin in IANA, but noosl and deber are separate time zones in CLDR (which is good). The question you can answer with ICU4X is "do these two IDs alias to the same CLDR time zone, i.e. point to the same tzif file and produce the same display names in UTS-35?" (the original question is equivalent to "have these two zones agreed on local time since 1970-01-01?").

What is the normalized IANA ID for this IANA ID? (IANA strings should always be normalized before use, because they're case-insensitive)

I don't see the use case for this. If IANA IDs are case-insensitive, no consumer should care whether they receive europe/berlin, eUrO_pE/beRLIN, or Europe/Berlin. And bit-by-bit strings comparison is wrong due to aliases anyway.

What is the list of time zones that I should use to build a UI picker? The list must include only one entry for each canonical IANA ID.

That is not the list you want to use for a UI picker, due to the Oslo/Berlin thing. You want to use the list of time zones defined in CLDR (which ICU4X can give you, represented by short IDs).

How should I efficiently store a large number of time zones in memory, assuming that canonicalization is OK? Canonicalization is OK when the zone never needs to be converted back into an IANA ID.

I think this use case needs more context. Do you want to deduplicate by CLDR identity (equal display names), or IANA identity (equal local time since 1970-01-01)? Storing CLDR identities should always be safe, as it only collapses zones that IANA also collapses, and you can always recover the canonical IANA ID from it (although a canonical IANA ID should never be required by any code consuming IANA IDs, as they are not stable).

How should I efficiently store a large number of time zones in memory when canonicalization is not OK, like when a user's choice of IANA ID needs to be persisted and we want to respect the user's initial choice, despite future changes to IANA like Kiev=>Kyiv.

IANA canonicalization is never OK for i18n purposes, as Norway Time should not become Germany Time.

Non-i18n purposes, i.e. pure time zone calculations, I consider outside of the current scope of ICU4X. TZDB rule data should be deduplicated by IANA-canonicality.

However, CLDR canonicalization should always be OK. I don't see a reason why we'd ever need to "respect the user's initial choice", time zone IDs are not user-facing strings (and even if they were, if Kyivans want their city to be called Kyiv, I think that's more important to respect). Both IDs behave the same, so we do not need to retain the original name (in the same way we don't have to retain the original casing).


I think the only functionality ICU4X needs to support is:

  • Parse an IANA ID into a CLDR zone identity (represented by a BCP-47 ID)
  • Retrieve any IANA ID that is associated with this CLDR zone identity, for interaction with IANA systems (such as IXDTF). We have two choices here:
    • The IANA canonical ID: this is not stable, and leads people to believe that they can use string equality, because string equality works if the application never sees a non-canonical ID. However, that invariant is very hard to uphold when communicating with other applications that use a different tzdb version (including communicating with a past or future version of yourself through storage). This can lead to very subtle bugs.
    • The CLDR canonical ID: this is stable, and things break for some zones if you expect to get canonical IANA IDs. This makes it easier for clients to realise during testing that there is no such thing as an eternal canonical IANA ID.
  • Retrieve the full list of CLDR zone identities in order to render a time zone picker

robertbastian avatar Jan 31 '25 23:01 robertbastian

I mostly agree with @robertbastian except that I believe normalization is still important, not only because it is required by ECMA-402. If you include normalization, @robertbastian's list is exactly what icu_timezone currently does.

sffc avatar Feb 01 '25 10:02 sffc

I also agree with @robertbastian in much of above, although there may be a few areas where we haven't yet reached agreement. I'll try to enumerate the parts where I think we agree, and where more discussion may be needed.

I'll also try to provide context around ECMAScript's time zone ID handling and why it was spec-ed as it is. Hopefully this is useful, even if not everyone agrees with the tradeoffs we made. I suspect that the folks on this thread already know much of this context, so I apologize in advance for any redundant info.

IANA canonicalization is never OK for i18n purposes, as Norway Time should not become Germany Time.

I think we're all agreed on this point. By "IANA Canonical" I meant the result of icu::TimeZone::getIanaID(), which uses the new iana attribute to solve the long-complained-about issue of programmer-facing APIs returning obsolete IDs like Asia/Calcutta or Europe/Kiev. I apologize for being unclear about this above.

Feel free to suggest another for this discussion. We could use "primary time zone identifier" which is what ECMA-262 calls it, but @robertbastian correctly noted in another issue (can't find it at the moment) that this term conflicts with "primary zone" in UTS-35, so I wasn't sure what to call it in context of ICU4X.

How should I efficiently store a large number of time zones in memory, assuming that canonicalization is OK? Canonicalization is OK when the zone never needs to be converted back into an IANA ID.

I think this use case needs more context. Do you want to deduplicate by CLDR identity (equal display names), or IANA identity (equal local time since 1970-01-01)?

CLDR identity. Although the TZDB maintainers have decided that coalescing time zones across country borders is OK, AFAICT most application developers disagree, because a change in one country's rules can break an app in another country. In TC39 we believe these changes were a mistake, we're happy that CLDR has decided not to follow them, and we've even reverse-engineered CLDR's approach into a deterministic algorithm in the ECMAScript spec so that non-CLDR-using implementations can still be compatible.

this is not stable, and leads people to believe that they can use string equality, because string equality works if the application never sees a non-canonical ID. However, that invariant is very hard to uphold when communicating with other applications that use a different tzdb version (including communicating with a past or future version of yourself through storage). This can lead to very subtle bugs.

I think we're all agreed about the problems from using non-stable identifiers, like what's returned by icu::TimeZone::getIanaID().

But there's no perfect solution to these problems. For ECMAScript, we considered three possible solutions:

  • Continue using obsolete IANA IDs for zones that have been renamed. This was judged to be undesirable based on many years of developer bug reports and complaints about obsolete names like Asia/Calcutta.
  • Start using CLDR short IDs instead of IANA IDs. This was judged to be impractical given that ECMAScript, like most other platforms and RFC 9557, long ago decided to use IANA IDs to represent time zones in public APIs and standards.
  • Mitigate the risk by removing canonicalization behavior from APIs, so that ECMAScript will never replace user-inputted IDs with a canonical ID. If a user provides Europe/Kiev or Europe/Kyiv to an API, then they'd get Europe/Kiev or Europe/Kyiv back, respectively.

TC39 decided the third option was the "least bad" choice for public APIs. Non-canonicalizing of inputs doesn't eliminate the problem because some APIs (like returning the system's time zone or enumerating all zones for a UI picker) still require exposing canonical IDs, but it does reduce the number of cases where changes in time zone rules trigger changes in programs' behavior. We're also hoping that a second-order effect of this change may make developers somewhat more likely to expect changes to time zone IDs which may push them to use Temporal.ZonedDateTime.prototype.equals to compare time zones for equality instead of relying on naive string comparison.

For the existing API Intl.DateTimeFormat.prototype.resolvedOptions().timeZone, it's possible that this spec-ed behavior may be rolled back if web compatibility issues end up outweighing programmer complaints like those linked above. But in new APIs like Temporal.ZonedDateTime where there's no legacy, user-inputted IDs will never be canonicalized.

Also, the current ECMAScript spec and Temporal both guarantee that if the programmer passes EuRoPe/KiEV then they'll get Europe/Kiev back, in order to support efficient storage of time zone IDs. For example, an implementer of Temporal.ZonedDateTime could store a 10-bit index into the list of ~600 IANA IDs instead of the exact string that the user provided. In theory, case normalization contradicts the "get the same ID back" requirement above. But in practice the perf benefits outweighed backwards compatibility for rare case typos.

My understanding is that ECMAScript (including Temporal) requires the following from a time zone library like ICU4X:

A library could provide dedicated APIs to do those things with perf that's good enough that no further caching or optimization is needed. Or it could let clients just have the data by providing an API that returns an array of {id, primaryId} pairs for every IANA ID, or (per the title of this issue) some other data structure that provides both the list of all valid IDs and a way to map each ID to its primary ID. Or it could provide both.

I don't see a reason why we'd ever need to "respect the user's initial choice", time zone IDs are not user-facing strings (and even if they were, if Kyivans want their city to be called Kyiv, I think that's more important to respect). Both IDs behave the same, so we do not need to retain the original name (in the same way we don't have to retain the original casing).

To clarify, "user" here means the programmer. I think that most developers involved in time zone programming would agree that using occasionally-renamed city names as time zone IDs wasn't an ideal choice, but it's the only system that's been adopted by the vast majority of programming platforms. So despite its faults, IANA IDs are the most appropriate choice for public APIs and standards.

Within that context, stopping canonicalizing user inputs is exactly "respect the user's initial choice" of millions of developers from India and Vietnam and Ukraine and elsewhere who see Asia/Calcutta, Asia/Saigon, and Europe/Kiev as JS engine bugs. (And in some cases as an offensive reminder of past colonial domination, but IMO "looks like an obvious bug" is justification enough.)

justingrant avatar Feb 02 '25 21:02 justingrant

Mitigate the risk by removing canonicalization behavior from APIs, so that ECMAScript will never replace user-inputted IDs with a canonical ID. If a user provides Europe/Kiev or Europe/Kyiv to an API, then they'd get Europe/Kiev or Europe/Kyiv back, respectively.

This is where we (me, ICU4X) seem to disagree. We don't consider transforming Europe/Kiev into Europe/Kyiv an issue. Can you provide some detail why ECMA thinks it is? I can see why programmers complain about obsolete IDs, but who's complaining about non-obsolete IDs?

robertbastian avatar Feb 03 '25 13:02 robertbastian

We had an extensive discussion about the meaning of "canonical" this morning, and this the API and guarantees we are converging on:

impl IanaParser {
    /// Parses an IANA string into a CLDR time zone identity
    pub fn parse(&self, iana_id: &str) -> TimeZoneId { ... }
}

// Contains more data than `IanaParser`
impl IanaParserExtended {
    /// Parses an IANA string into a CLDR time zone identity
    pub fn parse(&self, iana_id: &str) -> TimeZoneId { ... }

    /// Serializes a CLDR time zone identity into an IANA string.
    ///
    /// This method aims to return canonical IANA IDs. However, as the TZDB may change the canonical ID
    /// for a zone across releases, this method's behavior cannot be stable across TZDB versions.
    /// In order to mitigate compatibility issues, we define the behaviour of this method in terms of _two_ TZDB versions:
    ///
    ///  * The _current_ TZDB version, defined as the TZDB version that is associated with the current CLDR version.
    ///  * The _general-availability_ (_GA_) TZDB version, defined as most recent TZDB version that is at least two years
    ///    older than the current CLDR version. 
    ///
    /// With these definitions, we can accurately state what this method returns:
    /// a normalized IANA ID that either, with descending priority:
    ///
    /// 1) is canonical in the current TZDB version, and exists in the GA TZDB version.
    /// 2) is canonical in the GA TZDB version (but not necessarily canonical in the current TZDB version)
    /// 3) is canonical in the current TZDB version (and doesn't exist in the GA TZDB version)
    ///
    /// For example, consider these TZDB zones across 2020d and 2022g:
    ///
    /// GA (2020d):
    ///
    /// ```txt
    /// Zone Europe/Kiev
    /// Zone Europe/London
    /// 
    /// Link Europe/London Europe/Belfast
    /// ```
    ///
    /// Current (2022g):
    ///
    /// ```txt
    /// Zone Europe/Kyiv
    /// Zone Europe/London
    /// Zone America/Ciudad_Juarez
    /// 
    /// Link Europe/London Europe/Belfast
    /// Link Europe/Kyiv Europe/Kiev
    /// ```
    ///
    /// For this data:
    ///
    ///  * `parser.parse("Europe/Belfast").serialize()` returns `Europe/London`.
    ///      * This is case (1): systems on the GA version already understand this ID
    ///        (it was also already canonical for them as well, but that doesn't matter much).
    ///  * `parser.parse("Europe/Kyiv").serialize()` returns `Europe/Kiev`, even though that is not currently
    ///     the canonical ID.
    ///      * This is case (2): systems on the GA version do not understand `Europe/Kyiv` at all,
    ///        but they do understand `Europe/Kiev`.
    ///  * `parser.parse("America/Ciudad_Juarez").serialize()` returns `America/Ciudad_Juarez`.
    ///      * This is case (3): systems on the GA version do not understand this ID,
    ///        but there is no ID that they would understand.
    pub fn serialize(&self, id: TimeZoneId) -> &'data str { ... }
    
    /// Performs IANA normalization.
    ///
    /// ❗ Normalization is probably not what you are looking for.
    ///
    /// Both `europe/london` and `europe/belfast` behave the same way in both time zone
    /// calculations (they are aliases in TZDB) and time zone display names (they parse
    /// into the same CLDR [`TimeZoneId`], "UK Time"). There are very few use cases that
    /// require distinguishing them, while still requiring them to be IANA-normalized.
    ///
    /// Consider using [`Self::parse_serialize`] to "canonicalize" the string instead.
    pub fn normalize(&self, iana: &str) -> Option<&'data str> { ... }

    /// A more efficient combination of [`Self::parse`] and [`Self::serialize`].
    pub fn parse_serialize(&self, id: &str) -> (TimeZoneId, &'data str) { ... }

    /// A more efficient combination of [`Self::parse`] and [`Self::normalize`].
    pub fn parse_normalize(&self, iana: &str) -> (TimeZoneId, Option<&'data str>) { ... }
}

robertbastian avatar Feb 03 '25 14:02 robertbastian

Mitigate the risk by removing canonicalization behavior from APIs, so that ECMAScript will never replace user-inputted IDs with a canonical ID. If a user provides Europe/Kiev or Europe/Kyiv to an API, then they'd get Europe/Kiev or Europe/Kyiv back, respectively.

This is where we (me, ICU4X) seem to disagree. We don't consider transforming Europe/Kiev into Europe/Kyiv an issue. Can you provide some detail why ECMA thinks it is? I can see why programmers complain about obsolete IDs, but who's complaining about non-obsolete IDs?

Good question. Thanks for helping to narrow down the gap to discuss. I'll try to summarize the reasoning on the TC39 side. Let me know if this is a convincing explanation or if we should keep iterating.

The intent in retaining non-canonical IDs is to reduce the backwards compatibility issues when a city rename happens, including the long-delayed updates to Chrome and Safari to start returning Asia/Kolkata, Asia/Ho_Chi_Minh, Europe/Kyiv, etc. By reducing backwards-compatibility risk, it makes Google and Apple more comfortable with rolling out renames instead of freezing the web on decades-old names. And it makes it easier to align all ECMAScript engines on the same spec. Historically only Firefox follows the spec for timezone IDs while Chrome and Safari don't, and Shane and I and others have been working for the last few years to try to align them all.

Here's a realistic example: An automated test that creates a Temporal.ZonedDateTime instance using Europe/Kiev and then validates that all properties of that instance are as expected, including timeZoneId === 'Europe/Kiev'. If a new Chrome release breaks this test, then developers will reasonably blame Google for the break. Google is super-sensitive to these kinds of breaks, especially with CLDR data because there have been several recent examples where changes in CLDR data broke popular web apps.

If we couldn't have guaranteed that existing IDs would be stable, then Google (and likely Apple too) would never have agreed to roll out the changes to finally start surfacing Asia/Kolkata, Asia/Ho_Chi_Minh, Europe/Kyiv, etc. For this reason, the ability to round-trip IDs (modulo case normalization) is a hard requirement for ECMAScript. If it's not available in ICU4X then ECMAScript engines will have to roll their own normalization support. This is exactly what Firefox has had to do: in order to follow the ECMAScript spec, they have custom build steps that digest raw IANA files (not through ICU) in order to handle IDs, and FF only uses ICU for calculations not IDs. This seems both wasteful and risky because the two data sources can get out of sync.

OTOH, if user inputs are never canonicalized, then it reduces the number of apps (and especially automated tests) whose behavior will change when a rename happens. It doesn't eliminate the risk because the output of APIs like Intl.supportedValuesOf('timeZone') and Temporal.Now.timeZoneId() will still change, but at least we'll be able to guarantee to users that if they've stored an existing ID then its round-tripping behavior won't change.

Does this explanation seem reasonable?

I'll follow up with feedback on the interface definition above in a later comment.

justingrant avatar Feb 03 '25 22:02 justingrant

A few questions and feedback about the proposed API:

General questions:

Is there a different type that will provide an enumeration of all IANA IDs (both canonical and non-canonical)?

Is there an API that will let you test two IanaParserExtended instances for having the same canonical ID? Or should users parse both and compare the resulting strings? I ask because In ECMAScript we removed all canonicalization APIs and replaced then with an equals API. Not sure if that would be appropriate in Rust though.

Normalization

/// ❗ Normalization is probably not what you are looking for.
///
/// Both `europe/london` and `europe/belfast` behave the same way in both time zone
/// calculations (they are aliases in TZDB) and time zone display names (they parse
/// into the same CLDR [`TimeZoneId`], "UK Time"). There are very few use cases that
/// require distinguishing them, while still requiring them to be IANA-normalized.
///
/// Consider using [`Self::parse_serialize`] to "canonicalize" the string instead.
pub fn normalize(&self, iana: &str) -> Option<&'data str> { ... }

I'm very glad to see that ICU4X will offer this API, because ECMAScript needs it.

IMO the comment should be revised. Instead of a strongly-opinionated warning about using normalization, I think it'd be better to explain the two operations and what would be the reasons to use one or the other. I'll leave some ideas below for how this documentation could look, but feel free to ignore this if some other text would be better.

The first thing that the docs should do is to clarify exactly what the API actually does and clarify what "normalize" means. Something like:

Returns this identifier in the letter case used in the IANA Time Zone Database.

Next, I'd explain when you'd want to normalize. Something like this:

Normalization is used when you want to round-trip the user's input without changing anything beyond fixing letter case typos. For example if the user inputs "Europe/Belfast" or "europe/belfast" then they'll get "Europe/Belfast" back. Normalization is used for user-inputted IDs when it's important that the resulting program should not change its behavior if time zones are deprecated or renamed in a future IANA Time Zone Database version.

Then maybe note why you'd want to canonicalize instead?

Canonicalization, as provided by (API name here if we change it), not only normalizes letter case but also may change the identifier if the input identifier is deprecated. For example, "Asia/Calcutta" or "asia/calcutta" will be normalized to "Asia/Kolkata". Unlike normalization, canonicalization results may differ between IANA Time Zone Database versions so should only be used when stability of the resulting output is not required.

Maybe one more use case?

Note that (API name here) will always return a canonical identifier when converted to an IANA ID, so if you want to compare a caller-inputted IANA ID to the current time zone of the system, you should first parse the caller-provided ID into a TimeZoneId and compare that to the system time zone.

And then maybe mention other related APIs, something like this?

You can use (insert API here) to get the full list of ~600 IDs available, which can be helpful if you want to more efficiently store each string as an index or reference into the full list.

Canonicalization

I would add a line somewhere in this method's docs to explain when you'd want to use normalization instead.

/// This method aims to return canonical IANA IDs. However, as the TZDB may change the canonical ID
/// for a zone across releases, this method's behavior cannot be stable across TZDB versions.
/// If you require stable output across TZDB versions, use `normalize` istead.

re: the 2-year waiting period, maybe we could explain it in a bit more plain-English wording? Maybe something like this?

Note that if a new identifier is added to the IANA Time Zone Database to replace an older name, like 2022's introduction of Europe/Kyiv to replace Europe/Kiev, the old name will continue to be returned as canonical for about two years to reduce the chance of sending unrecognized IDs to external systems that have not yet been updated to the latest IANA Time Zone Database. Thankfully these renames are rare, happening on average to only one zone every two years.

pub fn serialize(&self, id: TimeZoneId) -> &'data str { ... }

Will this API match the output of icu::TimeZone::getIanaID()) ?

IMO the default serialization should be the normalized string not the canonical one. If the user wants to canonicalize the input (which can be unstable over time) then I think this should be an opt-in choice of the programmer rather than a default. FWIW, anyone who's used canonicalization by default in a public API (I'm thinking of JS and Java here) seems to regret that choice today.

justingrant avatar Feb 04 '25 00:02 justingrant

Instead of a strongly-opinionated warning about using normalization, I think it'd be better to explain the two operations and what would be the reasons to use one or the other.

All of the problems that you have outlined come from weird JS design decisions, like using raw strings as time zones, or exposing the time zone identifier (normalized, but not canonicalized for some reason) on the zoned datetime. I don't see a single reason why a Rust user would need to use normalization.

IMO the default serialization should be the normalized string not the canonical one.

Again, in ICU4X we store canonicalization equivalence classes. There is no way for us to make Europe/Belfast roundtrip, and that is by design.

anyone who's used canonicalization by default in a public API (I'm thinking of JS and Java here) seems to regret that choice today.

All those APIs presumably passed them around as raw strings that users would inevitably compare?

robertbastian avatar Feb 04 '25 09:02 robertbastian

FWIW, anyone who's used canonicalization by default in a public API (I'm thinking of JS and Java here) seems to regret that choice today.

Can you provide source of that claim? It seems a very strong one and divergent from my experience... FWIW.

I see this as a fundamental tradeoff between executing the role of i18n correctly and an ambiguous argument about "people expect it." The core design principle of i18n is that outputs cannot be stable for testing or persistence, and I don't think we have strong evidence that developers truly expect round-tripping, outside of tests where such an expectation is clearly incorrect. And the reasons for which the output may not match input go beyond name - it can be availability, matching algorithm selecting alternative zone for a given input, etc. The contract is "give me your inputs and i'll give you the best possible output, and that may differ between systems, environments, and over time".

If we start catering to the idea that "in this case, we should preserve the input," we open the door for more cases where people assume i18n outputs are stable. This erodes the ability of i18n systems to function as intended. Either we uphold non-stable i18n outputs, allowing internationalization to work properly, or we shift towards stable outputs—which fundamentally undermines the purpose of i18n. I don't see a real middle ground here.

zbraniecki avatar Feb 04 '25 10:02 zbraniecki

The proposed .parse_normalize() function returns the two things an ECMAScript implementations needs: a TimeZone for formatting, based on the CLDR equivalence class, and a &'data str, an interned, normalized string from data. An implementation that wishes to round-trip the input IANA can use this API.

On the other hand, an implementation that simply needs to format time zones can use the .parse() function.

One use case of serialize (turning an ICU4X TimeZone back into an IANA string) would be in IXDTF serialization. Is it bad to canonicalize the time zone ID when creating an IXDTF string?

Are there any other use cases of serialize? If IXDTF is the only use case, we could load the canonical name internally and then just remove the serialize function since it seems like it might be a footgun.

sffc avatar Feb 04 '25 16:02 sffc

The proposed .parse_normalize() function returns the two things an ECMAScript implementations needs: a TimeZone for formatting, based on the CLDR equivalence class, and a &'data str, an interned, normalized string from data. An implementation that wishes to round-trip the input IANA can use this API.

This sounds great.

One use case of serialize (turning an ICU4X TimeZone back into an IANA string) would be in IXDTF serialization. Is it bad to canonicalize the time zone ID when creating an IXDTF string?

I guess it depends? If the starting point is a BCP 47 ID then the canonical ID is the only sensible choice.

If your starting point was an IANA ID, then normalization seems like a better approach for a few reasons:

  • Behavior won't change when TZDB changes
  • More consistent with JS Temporal, Java, Kotlin, and Swift, none of which canonicalize time zone inputs.

Are there any other use cases of serialize? If IXDTF is the only use case, we could load the canonical name internally and then just remove the serialize function since it seems like it might be a footgun.

I assume that implementing supportedValuesOf('timeZone') or Temporal.Now.timeZoneId() will require the ability to convert a BCP47-based ID to its canonicalized IANA equivalent. So some kind of BCP47=>IANA conversion is needed. Is there some other way to do this besides serialize?

I do think that a name like canonicalize might be clearer than serialize, to help callers pick the correct operation for their use case. And the docs should be clearer than "Normalization is probably not what you are looking for" to explain what each operation does and why callers might want to use one or the other.

Zibi and Robert raise good points above and I'll try to respond tomorrow or Thursday. (Sorry, busy week at work because my company is Canadian and the tariff mess has thrown us for a loop.)

justingrant avatar Feb 05 '25 05:02 justingrant

FYI, @sffc @robertbastian and I had an impromptu chat last week about the questions above and I think we made good progress. Do you guys want to post a summary?

@zbraniecki Sorry it's taken so long to respond, but here's a few notes that may be helpful. Overall I agree with you!

I see this as a fundamental tradeoff between executing the role of i18n correctly and an ambiguous argument about "people expect it." The core design principle of i18n is that outputs cannot be stable for testing or persistence

I agree.

I would suggest that shrinking the scope of what's considered "output" will help, not hurt, our shared goal to get programmers and platforms to embrace this core i18n design principle. In pre-Temporal JS, programmers must (ab)use i18n APIs for tasks like outputting ISO 8601 dates for log files, to perform time zone-aware arithmetic, or to get IANA IDs for persistence of time zone data.

By providing APIs like Temporal.ZonedDateTime.prototype.timeZoneId that don't change behavior when underlying i18n data is updated, I hope we'll make it less likely that apps will break when i18n data is revised, and thus more likely that Google and Apple will be comfortable to update i18n data.

Isn't this similar to why we're introducing the stable formatting proposal: to provide a place where programmers can get stable output so they'll be less likely to expect stability from i18n APIs that generate localized output?

I don't think we have strong evidence that developers truly expect round-tripping, outside of tests where such an expectation is clearly incorrect.

I agree. What developers *do* expect is a solution to these problems:

  1. Many developers think that ICU's current out-of-date time zone IDs are buggy at best, offensive at worst. Here's another example from this week: https://unicode-org.atlassian.net/browse/CLDR-18299. Image

  2. Google and Apple have not fixed (1) because they are concerned that changing canonical IDs will break apps.

  3. Because of (2), Firefox has diverged from Chrome/Safari behavior which makes it harder to build cross-browser apps.

Round-tripping of user-provided IDs isn't a panacea, it's just the least-bad idea we found to help fix these intertwined problems.

FWIW, anyone who's used canonicalization by default in a public API (I'm thinking of JS and Java here) seems to regret that choice today.

Can you provide source of that claim? It seems a very strong one and divergent from my experience... FWIW.

Given that we've removed ID canonicalization from ECMA-402 after years of TG2 and then TG1 discussions, I assume no further source is needed for JS?

For Java, https://blog.joda.org/2021/09/big-problems-at-timezone-database.html is a good starting point. Here's an excerpt.

The situation with Joda-Time is even worse. With Joda-Time, aliases (also known as Links) are actively resolved. Before the proposed change this test case passes, after the proposed change the test case fails:

assertEquals("Europe/Oslo", DateTimeZone.forID("Europe/Oslo").getID());

In other words, it will be impossible for a Joda-Time user to hold the ID "Europe/Oslo" in memory. This could be pretty catastrophic for systems that rely on timezone management, particularly ones that end up storing that data in a database.

I actually thought that Java's built-in APIs also canonicalized, but I was mistaken; instead it was Joda-Time (which Java's built-in APIs evolved from) where the champion regretted their choice to canonicalize.

Java, Kotlin, Swift, and Python all have an IANA time zone API and none of them canonicalize IDs. This isn't evidence that they actively rejected canonicalization, but the fact that they don't canonicalize is perhaps helpful as we figure out what new APIs should do.

// Swift
import Foundation
let timeZone1 = TimeZone(identifier: "Asia/Calcutta")
print("Asia/Calcutta => \(timeZone1!.identifier)")
let timeZone2 = TimeZone(identifier: "Asia/Kolkata")
print("Asia/Kolkata => \(timeZone2!.identifier)")

// Asia/Calcutta => Asia/Calcutta
// Asia/Kolkata => Asia/Kolkata
// Kotlin
import java.util.TimeZone
fun main() {
  val timeZone1 = TimeZone.getTimeZone("Asia/Calcutta")
  println("Asia/Calcutta => ${timeZone1.id}")
  val timeZone2 = TimeZone.getTimeZone("Asia/Kolkata")
  println("Asia/Kolkata => ${timeZone2.id}")
}

// Asia/Calcutta => Asia/Calcutta
// Asia/Kolkata => Asia/Kolkata
// Java
import java.time.ZoneId;
import java.util.TimeZone;

class Main {
  public static void main(String[] args) {

    ZoneId timeZoneId1 = ZoneId.of("Asia/Calcutta");
    System.out.println("Asia/Calcutta => " + timeZoneId1.toString());
    ZoneId timeZoneId2 = ZoneId.of("Asia/Kolkata");
    System.out.println("Asia/Kolkata => " + timeZoneId2.toString());

    // Asia/Calcutta => Asia/Calcutta
    // Asia/Kolkata => Asia/Kolkata

    TimeZone timeZone1 = TimeZone.getTimeZone("Asia/Calcutta");
    System.out.println("Asia/Calcutta => " + timeZone1.getID());
    TimeZone timeZone2 = TimeZone.getTimeZone("Asia/Kolkata");
    System.out.println("Asia/Kolkata => " + timeZone2.getID());

    // Asia/Calcutta => Asia/Calcutta
    // Asia/Kolkata => Asia/Kolkata
  }
}
# Python
from zoneinfo import ZoneInfo
print (ZoneInfo('Asia/Calcutta').key)
print (ZoneInfo('Asia/Kolkata').key)

# => Asia/Calcutta
# => Asia/Kolkata

justingrant avatar Feb 11 '25 14:02 justingrant

FYI, @sffc @robertbastian and I had an impromptu chat last week about the questions above and I think we made good progress. Do you guys want to post a summary?

We decided:

  • IanaParserExtended::parse returns a struct containing 3 fields: the ICU4X TimeZone, the normalized IANA string reference, and the canonicalized IANA string reference
  • IanaParserExtended also has two functions returning iterators, both returning an iterator of data strings:
    • iter_all()
    • iter_canonical()
  • Optionally add a function IanaParserExtended::stringify(&self, time_zone) which returns the canonical IANA for the ICU4X TimeZone (we have not agreed on the name for this)

sffc avatar Feb 14 '25 13:02 sffc

IanaParserExtended also has two functions returning iterators, both returning an iterator of data strings:

  • iter_all()
  • iter_canonical()

Should we also have an iterator for "struct containing 3 fields: the ICU4X TimeZone, the normalized IANA string reference, and the canonicalized IANA string reference"?

Also, will those iterators be returned in a particular sort order? Might be nice to avoid ECMAScript callers from having to re-sort the results in lexicographic order.

  • Optionally add a function IanaParserExtended::stringify(&self, time_zone) which returns the canonical IANA for the ICU4X TimeZone (we have not agreed on the name for this)

Does this add enough value over the planned parse method, given that both of them will return the canonical IANA?

justingrant avatar Feb 18 '25 21:02 justingrant

IanaParserExtended also has two functions returning iterators, both returning an iterator of data strings:

  • iter_all()
  • iter_canonical()

Should we also have an iterator for "struct containing 3 fields: the ICU4X TimeZone, the normalized IANA string reference, and the canonicalized IANA string reference"?

We could potentially change iter_all() to iterate over the 3-tuples.

Also, will those iterators be returned in a particular sort order? Might be nice to avoid ECMAScript callers from having to re-sort the results in lexicographic order.

Depends on the data model in #6061. My current favorite model is to group them by canonical class, sorted by BCP-47 ID, and then alphabetical by IANA within the canonical class. This would result in a fairly arbitrary-looking order to clients. We could change the data model to return in alphabetical order, but it would make certain other operations less efficient (such as the proposed stringify function).

  • Optionally add a function IanaParserExtended::stringify(&self, time_zone) which returns the canonical IANA for the ICU4X TimeZone (we have not agreed on the name for this)

Does this add enough value over the planned parse method, given that both of them will return the canonical IANA?

That's why it says "optionally": we agreed on the general shape and we can add it when someone needs it

sffc avatar Feb 18 '25 21:02 sffc

We could potentially change iter_all() to iterate over the 3-tuples.

This seems like a good idea. Having all three fields unlocks more use cases, and converting from the 3-typles to a flat list is a trivial exercise for the caller. The reverse is harder: you'd need to first iter_all() and then call parse on each iteration, which seems like a much larger perf cost than going the other way around.

Depends on the data model in #6061. My current favorite model is to group them by canonical class, sorted by BCP-47 ID, and then alphabetical by IANA within the canonical class.

IIRC, ECMA-402 requires Intl.suportedValuesOf('timeZone') to be sorted lexicographically, so ideally iter_canonical() would be sorted by IANA canonical ID not BCP-47 ID. Unless there's some other value to sorting by BCP-47 ID?

For iter_all, I don't have a strong preference for sort order. I'd probably have a weak preference for grouping first by canonical IANA, and for the second level returning the canonical ID first and then all the aliases for that ID in lexicographic order. This order might make it more efficient for callers to assemble various 2-level-deep data structures.

justingrant avatar Feb 18 '25 23:02 justingrant