icu4x
icu4x copied to clipboard
Baked data is bigger than postcard data
I computed fingerprints.csv based on both baked_size and postcard_size.
Baked is equal in size or bigger than postcard for every data marker. A selection of the biggest offenders by overall size or percentage:
| Marker Path | Postcard Size | Baked Size | Growth |
|---|---|---|---|
| list/or@2 | 3780B | 31520B | 734% |
| plurals/ranges@1 | 64B | 524B | 719% |
| percent/essentials@1 | 415B | 2945B | 610% |
| decimal/symbols@1 | 2040B | 8905B | 337% |
| relativetime/narrow/second@1 | 7678B | 26032B | 239% |
| units/displaynames@1 | 963404B | 2062721B | 114% |
| currency/extended@1 | 1908533B | 3134831B | 64.3% |
| displaynames/languages@1 | 1521724B | 1557050B | 2.32% |
The good news is that many of these keys will be improved under #5230 or #5379.
Should we do anything?
@robertbastian @Manishearth
For list the explanation is that the data struct contains 10 cows (4 patterns of 1 cow, and two conditions of 3 cows each), but usually only encodes tiny texts (, , and, etc.). Unit and And data doesn't show up because the Spanish/Hebrew regexes equalise things between baked and postcard. So it's a similar problem to decimal formatter.
"baked size" here is the size of the .rs file, yes?
no, an estimate for in-memory size, ignoring &'static deduplication
in-memory size means actual PSS cost? Can we also calculate on-disk binary size impact?
If you tell me what PSS is I might be able to answer this
list/and@1 also showed up, but I didn't take the time to copy it into the table; I tried to include a representative cross-section in the OP.
"baked size" refers to the in-memory size based on the bake_size, which is core::mem::size_of plus borrows_size.
These numbers are roughly reflective of what happens when I compile ICU4X with the compiled_data feature versus when I build Postcard data with icu4x-datagen. compiled_data produces a larger binary than no-default-features with postcard.
Note: I have some measurements in https://github.com/unicode-org/icu4x/issues/1317#issuecomment-2330283151 that illustrate the two types of binary size consumed by the baked data: the strings themselves (borrows_size) and the struct stacks (size_of).