icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Decide on data file versioning policy

Open sffc opened this issue 2 years ago • 3 comments

One of the value propositions driving ICU4X is the ability to share one data file across multiple ICU4X instances.

Design doc: https://docs.google.com/document/d/1yg_2l5FFo0aAuNi4jpgcIhIYjHqJyUoJWtMduyQ0vR8/edit#

Seeking feedback from:

  • [x] @nciric
  • [x] @Manishearth
  • [x] @markusicu

Leaving comments in the doc is fine. Thanks!

sffc avatar Jan 06 '22 05:01 sffc

Working in the doc (leaving comments etc.). LGTM on overall approach.

nciric avatar Jan 06 '22 21:01 nciric

Good goals, plausible ideas.

I just want to say that it is sometimes very desirable to dramatically redesign a data structure, in order to support smaller size, more performance, or new requirements. And when that happens, it can be very expensive (amount of code, time to convert while loading) to compute a new data structure from an old one.

For example, my first implementation of Unicode Normalization supported what most Unicode implementers know as the standard forms: NFC, NFD, NFKC, NFKD. I used a code point trie with a 32-bit value width, plus some additional arrays, mostly indexed by bit fields from the trie value. Years later, there was a call for additional normalization "forms", defined explicitly (NFKC_Casefold) or implicitly (UTS 46) by Unicode, or with truly custom data (Google), then HarfBuzz wanted access to more normalization "properties", then I revised the encoding for more pre-computation in the builder for simpler/faster runtime code, and then I improved the code point trie structure (e.g., making it more UTF-8-friendly). Some of these were possible as minor-version updates that old code could ignore, but several of these changes were dramatic.

For the dramatic changes here, old-data-new-code would require taking apart a 30-60kB structured blob, duplicating significant parts of the new-data builder into the runtime code, and building the new data on the fly. There would be a fair bit of new code for this conversion, code that would not be otherwise necessary.

If old-data-new-code had been necessary, I might have opted for retaining the old runtime code rather than writing new code to convert old data to new data. That might have yielded a smaller overall code size, and much less development. However, I would have had to retain two or more versions of the runtime code, fix important bugs in each version, and decide at runtime which version to use based on the available data.

An even more dramatic data structure redesign happened for collation, and the files there are larger (and the structure more complex) than for normalization. And the ICU collation data is not yet using the latest code point trie structure.

I suggest you add to the design doc some discussion of when it is expensive to "map the old data struct to the new data struct", and consider the option of retaining multiple parallel runtime implementations, with pros & cons.

markusicu avatar Jan 11 '22 19:01 markusicu

Action: @sffc to document this.

sffc avatar Jul 28 '22 17:07 sffc