boa icon indicating copy to clipboard operation
boa copied to clipboard

Experiment with ICU4X

Open jasonwilliams opened this issue 4 years ago • 5 comments

https://github.com/unicode-org/icu4x will be useful for implementing i18n-sensititve operations and future proposals like Temporal

jasonwilliams avatar Mar 18 '21 15:03 jasonwilliams

Ok, details I found while investigating ICU4X:

  • It requires a DataProvider in order to do actions, which is not trivial to obtain.
  • A DataProvider can be obtained using the icu4x-datagen crate, but it is not published on crates.io, so we would need to import the repo as a submodule if we want to automatize it.
  • We can use a StaticDataProvider to embed a DataProvider on the binary with include_bytes!.
  • We can also obtain the data from http://unicode.org/Public/cldr/ but we would need to code a parser into a BlobSchema for it to be easily embeddable as a StaticDataProvider.
  • We can use a build.rs script to avoid having to do these things by hand.
  • The collator we require is a WIP: https://github.com/unicode-org/icu4x/issues/971

jedel1043 avatar Oct 04 '21 02:10 jedel1043

CC @sffc

jasonwilliams avatar Oct 04 '21 12:10 jasonwilliams

Hi there! I just saw this today.

Here are instructions on how to generate the ICU4X data file:

https://crates.io/crates/icu_datagen

Specific replies inline:

  • A DataProvider can be obtained using the icu4x-datagen crate, but it is not published on crates.io, so we would need to import the repo as a submodule if we want to automatize it.

It is on crates.io; see link above.

  • We can use a StaticDataProvider to embed a DataProvider on the binary with include_bytes!.

Correct. This is the easiest way to include data.

  • We can also obtain the data from http://unicode.org/Public/cldr/ but we would need to code a parser into a BlobSchema for it to be easily embeddable as a StaticDataProvider.

You should use icu4x-datagen to generate the data. You need the CLDR data available at build time.

  • We can use a build.rs script to avoid having to do these things by hand.

We have an issue to track this: https://github.com/unicode-org/icu4x/issues/1188

@hsivonen has been working on the collator and can share more about the timeline for this feature.

sffc avatar Feb 15 '22 03:02 sffc

There's now an ICU4X PR that shows the status of the collator.

hsivonen avatar Mar 28 '22 12:03 hsivonen

There's now an ICU4X PR that shows the status of the collator.

Nice! I also saw that you're about to merge a PR with a datagen API for build.rs scripts (https://github.com/unicode-org/icu4x/pull/1819). I'll try to experiment with your branches in the meantime, and hopefully we'll be able to integrate ICU4X in our codebase on your next release!

jedel1043 avatar Apr 28 '22 04:04 jedel1043