icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Parse ICU files into ICU4X data formats

Open nordzilla opened this issue 4 years ago • 9 comments

One of the formats from which we should be able to generate ICU4X TimeZone data is the ICU format, particularly in the zoneInfo64.txt files.

We will have to write our own parser for this.

This will then match the combined --tz-src-format icu and --tz-src-path in the datagen crate.

For more context see

  • #1008

Depends on:

  • #998
  • #999

nordzilla avatar Aug 25 '21 22:08 nordzilla

I think this is more of an enhancement right now than a core feature. I'll change the label. (Same with #1002)

sffc avatar Aug 25 '21 23:08 sffc

Fuchsia currently ships ICU time zone files; @filmil says:

We currently use the ICU time zone format because everything else does, so we can ship one physical copy for all the components. Makes a difference on an embedded device with limited storage.

sffc avatar Jan 25 '23 21:01 sffc

An important question to answer is whether we want to read from zoneinfo64.res files efficiently at runtime. I think the answer is YES, because of @filmil's comment above, which is something I've heard elsewhere, too.

There are two general ways to do this that I can think of:

  1. Make a trait, like TimeZoneOffsetProvider, that can be implemented on top of either zoneinfo64.res or the ICU4X data model
  2. Make the ICU4X data model closely mirror zoneinfo64.res so that we can directly create a data struct from zerovecs that we load zero-copy from zoneinfo64.res

Some tradeoffs:

  1. Sharing the data model means that it would be easy to either read zoneinfo64.res at runtime or transform it at datagen time
  2. But, zoneinfo64's data model is very resb-specific; we could get a data model better for ICU4X by designing our own
  3. But, zoneinfo64 is widely available and this is maybe not the place where we want to reinvent the data layout completely. Consider Collator, Segmenter, and CaseMapper which all consume ICU data layouts but make slightly different decisions about the level to which they try to adhere to the exact layout coming from ICU
  4. If we make the trait, we still probably want a way to transform from zoneinfo64 into the ICU4X data model at datagen time.
  5. According to @leftmostcat, the zoneinfo64 data model requires more runtime processing than Erik's ICU4X data model proposed in #2913. (Can you elaborate?)

CC @robertbastian

sffc avatar Nov 14 '23 23:11 sffc

5. According to @leftmostcat, the zoneinfo64 data model requires more runtime processing than Erik's ICU4X data model proposed in [TZDB Datagen #2913](https://github.com/unicode-org/icu4x/pull/2913). (Can you elaborate?)

While Erik's PR represents historic transitions as i64, resb does not have explicit support for anything other than 32-bit (or 28-bit) integers. However, it still needs to represent transitions that may occur outside of the range of an i32, so it contains three lists of explicit transitions: transPre32, trans, and transPost32. trans is simply a series of i32, but both transPre32 and transPost32 are pairs of 32-bit integers which at runtime need to be shifted and masked into an i64. For example, uslax's initial transition to UTC-8 in the data is 1883-11-18T20:00:00Z. While the generated data can store this as -2717640000, zoneinfo64.res instead stores this as [-1, 1577327296] (see https://github.com/unicode-org/icu/blob/main/icu4c/source/data/misc/zoneinfo64.txt#L683) and ICU transforms as needed.

Time zone lookup by ID requires a search through the (UTF-16) list of IANA IDs, noting the index at which the desired index occurs, and then accessing zone data at the same index in the zones list. Looking up the current transition rule by ID requires the above process, followed by conversion of the UTF-16 rule name to UTF-8 and lookup of that name in the rules map.

leftmostcat avatar Nov 15 '23 00:11 leftmostcat

I'm not too concerned about having to do some bit-shifts to get numbers in the right form. That's a constant-time, alloc-free operation. We can explore ways to model this in the data struct.

I'm a little bit concerned about the impact of zoneinfo64 using IANA IDs instead of BCP-47 IDs. If we went with the data struct approach, I think we can solve this problem though by doing some pre-processing after loading zoneinfo64. It means that the zoneinfo64 constructor would not be completely free, but at least we can make it mostly zero-copy except for the time zone index table. If we went with the trait approach, we could do the BCP-47-to-IANA transformation at runtime (we can make it cheap) and avoid allocating memory.

Is this the entirely of the characterization of the diff between Erik's model and zoneinfo64's model?

sffc avatar Nov 15 '23 18:11 sffc

It's not the entirety, but those are the places I know of where we'd need data processing before we could use anything.

Both essentially boil down to lists of time zone transitions at fixed dates and then a specification of a recurring rule to take over after the fixed dates. What we mostly want to do is figure out which transition applies to a certain date and either the transition before or transition after. I think we can develop a trait that covers those operations on top of either model.

leftmostcat avatar Nov 16 '23 00:11 leftmostcat

Just read up on this conversation.

I agree with the sentiment of doing some amount of pre-processing at (or before) construction time.

That aligns with ICU4X's typical strategy of expensive constructors/data processing and cheap runtime.

I agree with Shane that it would important to map to the BCP47 identifiers.

The pair of i32 situation sounds like a strange choice to me on the surface, but I'm sure there's a reason they chose to do that.

But shifting is cheap, but if we're going to be pre-processing, we could probably just store i64s in the final type anyway, right?

nordzilla avatar Nov 18 '23 16:11 nordzilla

A couple thoughts:

  • I would consider using platform provided resb files a power user feature, similar to how non-compiled data is a power user feature these days. Hence this can go through unstable constructors
  • Using platform resb is always trading space for performance: it requires loading and deserialization, which compiled data does not. IIUC these will be files loaded at runtime, so I don't think an allocation is the end of the world if we're already reading a file from disk (we can then unallocate the whole file and keep a more tailored representation in memory).
  • The resb files are not endian-agnostic, whereas our representation will be
  • We should isolate the resb conversion costs to a special ResbBackedTimezoneDataProvider or something, which can use caching, but returns nice ICU4X data structs and doesn't restrict the data struct design
  • I'd rather use a provider than a special trait. Our data provider infrastructure is tried and tested
  • Datagen would use this same provider to precompute the data structs

robertbastian avatar Nov 21 '23 10:11 robertbastian

Chatted with @leftmostcat. The direction we plan to take in the immediate term is to plumb zoneinfo64 into an experimental module in icu_timezone via the trait so that we have something running end-to-end. The trait is useful because clients including Mozilla would like to be able to implement it to read from a variety of other time zone sources, like the operating system. However, I would also like to see zoneinfo64 able to be used as a data source in datagen, which means we still likely will want the actual data struct mapping as well. In terms of timeline, @leftmostcat thinks we can get the trait and zoneinfo64 machinery landed in Q4. Then, in Q1, we can refactor things as needed, which will be easier to do with a baseline implementation that works end to end.

sffc avatar Nov 28 '23 23:11 sffc