icu4x
icu4x copied to clipboard
Make a BlobSchema version with Index32
--all-keys --locales recommended does not fit in Index16. However, Index16 may still save space when fewer locales are being bundled, such as for language packs. Especially given that it may be disruptive to change the Blob2 layout, I think we should add Blob3 which is Blob2 but with Index32 on the locales table.
Thanks @zbraniecki for discovering this
Discussed:
- @robertbastian: should people even be making giant blobs? slice them!
- @manishearth: monolithic data bundle is often convenient from a versioning and distribution standpont
- @zbraniecki: it's what firefox does
- @robertbastian: Not opposed to blob3, don't want to change blob2 at this moment, might have users
- (As far as we know the only user is @sffc)
- @manish and @robertbastian: what we can do is have this be a purely internal change: BlobSchemaV2 gets a BlobSchemaV2Bigger sibling, and BlobSchema gets an additional variant. The variant to produce is selected at blob export time when running with
--format blob2(where it currently unwraps and panics). There is no new datagen blob format. This retains compatibility as well as the lower data sizes for blobs that are not full of locales.
Agreed on internal schema addition with no externally controllable format: @manishearth @robertbastian @zbraniecki
@sffc, wdyt?
Seems reasonable!
IIUC the proposal under the hood is similar to blob3 but instead of it being user-selectable, it is automatically selected from the blob2 CLI option.
Note that this means blob2 packages made in ICU4X 1.5 won't be readable in ICU4X 1.4, but I don't really care since we have blob which is readable all the way back.
Going to try and knock this out.
Hopefully the following patch is sufficient to repro:
diff --git a/tools/testdata-scripts/src/bin/make-testdata-legacy.rs b/tools/testdata-scripts/src/bin/make-testdata-legacy.rs
index 46685cacc..93a908feb 100644
--- a/tools/testdata-scripts/src/bin/make-testdata-legacy.rs
+++ b/tools/testdata-scripts/src/bin/make-testdata-legacy.rs
@@ -45,7 +45,7 @@ fn main() {
};
icu_datagen::datagen(
- Some(LOCALES),
+ None,
&icu_datagen::all_keys_with_experimental()
.into_iter()
.chain([icu_provider::hello_world::HelloWorldV1Marker::KEY])
That didn't work. Hm.
$ RUST_BACKTRACE=1 cargo run --bin icu4x-datagen -- --all-keys --locales recommended --format blob2 --out test.postcard
works
@sffc do you have a suggestion for a good way to structure a test for this?
We could have it generate some fake key for thousands of fake locales, or something similar. Not sure what the thresholds are.
The thing that runs out of space is the
pub locales: &'data VarZeroSlice<[u8]>,
I think you should hit the threshold if you export a key for all locales of the form abc-DE for all values of a, b, c, D, and E. So, you can make an IterableDataProvider that returns all of those locales, and use it to export HelloWorldV1 or something.
Yeah that's basically what I was wondering for "generate thousands of locales". abc-DE seems easy.
https://github.com/unicode-org/icu4x/pull/4856
Maybe try ab-DE first since that might hit the threshold and takes one order of magnitude off the slowness
Another way to guarantee hitting the threshold is to make a single locale with an aux key that is at least 2^16 bytes long, but that might be brittle when we rework the aux key infra in 2.0