icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Baked data is bigger than postcard data

Open sffc opened this issue 1 year ago • 6 comments
trafficstars

I computed fingerprints.csv based on both baked_size and postcard_size.

Baked is equal in size or bigger than postcard for every data marker. A selection of the biggest offenders by overall size or percentage:

Marker Path Postcard Size Baked Size Growth
list/or@2 3780B 31520B 734%
plurals/ranges@1 64B 524B 719%
percent/essentials@1 415B 2945B 610%
decimal/symbols@1 2040B 8905B 337%
relativetime/narrow/second@1 7678B 26032B 239%
units/displaynames@1 963404B 2062721B 114%
currency/extended@1 1908533B 3134831B 64.3%
displaynames/languages@1 1521724B 1557050B 2.32%

The good news is that many of these keys will be improved under #5230 or #5379.

Should we do anything?

@robertbastian @Manishearth

sffc avatar Aug 21 '24 22:08 sffc

For list the explanation is that the data struct contains 10 cows (4 patterns of 1 cow, and two conditions of 3 cows each), but usually only encodes tiny texts (, , and, etc.). Unit and And data doesn't show up because the Spanish/Hebrew regexes equalise things between baked and postcard. So it's a similar problem to decimal formatter.

robertbastian avatar Aug 22 '24 15:08 robertbastian

"baked size" here is the size of the .rs file, yes?

Manishearth avatar Aug 22 '24 15:08 Manishearth

no, an estimate for in-memory size, ignoring &'static deduplication

robertbastian avatar Aug 22 '24 16:08 robertbastian

in-memory size means actual PSS cost? Can we also calculate on-disk binary size impact?

zbraniecki avatar Aug 22 '24 17:08 zbraniecki

If you tell me what PSS is I might be able to answer this

robertbastian avatar Aug 22 '24 18:08 robertbastian

list/and@1 also showed up, but I didn't take the time to copy it into the table; I tried to include a representative cross-section in the OP.

"baked size" refers to the in-memory size based on the bake_size, which is core::mem::size_of plus borrows_size.

These numbers are roughly reflective of what happens when I compile ICU4X with the compiled_data feature versus when I build Postcard data with icu4x-datagen. compiled_data produces a larger binary than no-default-features with postcard.

sffc avatar Aug 22 '24 20:08 sffc

Note: I have some measurements in https://github.com/unicode-org/icu4x/issues/1317#issuecomment-2330283151 that illustrate the two types of binary size consumed by the baked data: the strings themselves (borrows_size) and the struct stacks (size_of).

sffc avatar Dec 17 '24 02:12 sffc