icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Make a BlobSchema version with Index32

Open sffc opened this issue 1 year ago • 3 comments

--all-keys --locales recommended does not fit in Index16. However, Index16 may still save space when fewer locales are being bundled, such as for language packs. Especially given that it may be disruptive to change the Blob2 layout, I think we should add Blob3 which is Blob2 but with Index32 on the locales table.

Thanks @zbraniecki for discovering this

sffc avatar Dec 13 '23 16:12 sffc

Discussed:

  • @robertbastian: should people even be making giant blobs? slice them!
  • @manishearth: monolithic data bundle is often convenient from a versioning and distribution standpont
  • @zbraniecki: it's what firefox does
  • @robertbastian: Not opposed to blob3, don't want to change blob2 at this moment, might have users
  • (As far as we know the only user is @sffc)
  • @manish and @robertbastian: what we can do is have this be a purely internal change: BlobSchemaV2 gets a BlobSchemaV2Bigger sibling, and BlobSchema gets an additional variant. The variant to produce is selected at blob export time when running with --format blob2 (where it currently unwraps and panics). There is no new datagen blob format. This retains compatibility as well as the lower data sizes for blobs that are not full of locales.

Agreed on internal schema addition with no externally controllable format: @manishearth @robertbastian @zbraniecki

@sffc, wdyt?

Manishearth avatar Jan 11 '24 17:01 Manishearth

Seems reasonable!

sffc avatar Jan 16 '24 22:01 sffc

IIUC the proposal under the hood is similar to blob3 but instead of it being user-selectable, it is automatically selected from the blob2 CLI option.

Note that this means blob2 packages made in ICU4X 1.5 won't be readable in ICU4X 1.4, but I don't really care since we have blob which is readable all the way back.

sffc avatar Jan 16 '24 22:01 sffc

Going to try and knock this out.

Hopefully the following patch is sufficient to repro:

diff --git a/tools/testdata-scripts/src/bin/make-testdata-legacy.rs b/tools/testdata-scripts/src/bin/make-testdata-legacy.rs
index 46685cacc..93a908feb 100644
--- a/tools/testdata-scripts/src/bin/make-testdata-legacy.rs
+++ b/tools/testdata-scripts/src/bin/make-testdata-legacy.rs
@@ -45,7 +45,7 @@ fn main() {
     };
 
     icu_datagen::datagen(
-        Some(LOCALES),
+        None,
         &icu_datagen::all_keys_with_experimental()
             .into_iter()
             .chain([icu_provider::hello_world::HelloWorldV1Marker::KEY])

Manishearth avatar Apr 30 '24 23:04 Manishearth

That didn't work. Hm.

Manishearth avatar Apr 30 '24 23:04 Manishearth

$ RUST_BACKTRACE=1 cargo run --bin icu4x-datagen -- --all-keys --locales recommended --format blob2 --out test.postcard 

works

Manishearth avatar Apr 30 '24 23:04 Manishearth

@sffc do you have a suggestion for a good way to structure a test for this?

We could have it generate some fake key for thousands of fake locales, or something similar. Not sure what the thresholds are.

Manishearth avatar May 01 '24 00:05 Manishearth

The thing that runs out of space is the

pub locales: &'data VarZeroSlice<[u8]>,

I think you should hit the threshold if you export a key for all locales of the form abc-DE for all values of a, b, c, D, and E. So, you can make an IterableDataProvider that returns all of those locales, and use it to export HelloWorldV1 or something.

sffc avatar May 01 '24 00:05 sffc

Yeah that's basically what I was wondering for "generate thousands of locales". abc-DE seems easy.

Manishearth avatar May 01 '24 00:05 Manishearth

https://github.com/unicode-org/icu4x/pull/4856

Manishearth avatar May 01 '24 00:05 Manishearth

Maybe try ab-DE first since that might hit the threshold and takes one order of magnitude off the slowness

sffc avatar May 01 '24 00:05 sffc

Another way to guarantee hitting the threshold is to make a single locale with an aux key that is at least 2^16 bytes long, but that might be brittle when we rework the aux key infra in 2.0

sffc avatar May 01 '24 00:05 sffc