icu
icu copied to clipboard
ICU-21812 move Age/Block/Script to a separate trie
Move Age/Block/Script to a separate trie; because
- The properties vectors in uprops.icu/uprops.h are full, and making them longer is expensive.
- We are about to overflow the bits for the Age property.
- These properties seem correlated fairly well with each other but not with other properties.
Also
- Move some bit-set-setting code from the emojipropsbuilder to toolutil so that it can be shared.
- Some code cleanup, such as in the corepropsbuilder distinguish end vs. pvecEnd.
- Making a uprops.icu major-version change allows us to move the Script+Script_Extensions bits back together into a contiguous bit set.
Problem: Pulling these three properties out makes uprops.icu significantly larger. Based on Unicode 15.1 data:
Original:
trie size in bytes: 46328
size in bytes of additional props trie:65584
number of additional props vectors: 2499
number of 32-bit words per vector: 3
number of 16-bit scriptExtensions: 298
data size: 142560
(The data size includes 64 bytes for the file header and the indexes array.)
Pulling Age/Block/Script into a separate trie (fast, 32-bit code point trie):
trie size in bytes: 46328
size in bytes of additional props trie:54760
number of additional props vectors: 673
number of 32-bit words per vector: 3
size in bytes of ABS trie: 86748
number of 16-bit scriptExtensions: 298
data size: 196572
Pulling them out into separate tries:
- Age: fast 8-bit
- Block: small 16-bit indexed by code point/16
- Script/scx: fast 16-bit
trie size in bytes: 46328
size in bytes of additional props trie:54760
number of additional props vectors: 673
number of 32-bit words per vector: 3
size in bytes of Age trie: 21596
size in bytes of Block trie: 7600
size in bytes of Script trie: 34420
number of 16-bit scriptExtensions: 298
data size: 173440
Same, but small tries for Age & Script:
trie size in bytes: 46328
size in bytes of additional props trie:54760
number of additional props vectors: 673
number of 32-bit words per vector: 3
size in bytes of Age trie: 17128
size in bytes of Block trie: 7600
size in bytes of Script trie: 25984
number of 16-bit scriptExtensions: 298
data size: 160536
Pulling only Block out into a separate trie:
trie size in bytes: 46328
size in bytes of additional props trie:62752
number of additional props vectors: 2026
number of 32-bit words per vector: 3
size in bytes of Block trie: 7600
number of 16-bit scriptExtensions: 298
data size: 141652
Only this last version actually makes uprops.icu slightly smaller than the original.
FYI @echeran
Checklist
- [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21812
- [x] Required: The PR title must be prefixed with a JIRA Issue number.
- [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [x] Required: Each commit message must be prefixed with a JIRA Issue number.
- [x] Issue accepted (done by Technical Committee after discussion)
- [ ] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable