icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-21812 move Age/Block/Script to a separate trie

Open markusicu opened this issue 1 year ago • 0 comments

Move Age/Block/Script to a separate trie; because

  • The properties vectors in uprops.icu/uprops.h are full, and making them longer is expensive.
  • We are about to overflow the bits for the Age property.
  • These properties seem correlated fairly well with each other but not with other properties.

Also

  • Move some bit-set-setting code from the emojipropsbuilder to toolutil so that it can be shared.
  • Some code cleanup, such as in the corepropsbuilder distinguish end vs. pvecEnd.
  • Making a uprops.icu major-version change allows us to move the Script+Script_Extensions bits back together into a contiguous bit set.

Problem: Pulling these three properties out makes uprops.icu significantly larger. Based on Unicode 15.1 data:

Original:

trie size in bytes:                    46328
size in bytes of additional props trie:65584
number of additional props vectors:     2499
number of 32-bit words per vector:         3
number of 16-bit scriptExtensions:       298
data size:                            142560

(The data size includes 64 bytes for the file header and the indexes array.)

Pulling Age/Block/Script into a separate trie (fast, 32-bit code point trie):

trie size in bytes:                    46328
size in bytes of additional props trie:54760
number of additional props vectors:      673
number of 32-bit words per vector:         3
size in bytes of ABS trie:             86748
number of 16-bit scriptExtensions:       298
data size:                            196572

Pulling them out into separate tries:

  • Age: fast 8-bit
  • Block: small 16-bit indexed by code point/16
  • Script/scx: fast 16-bit
trie size in bytes:                    46328
size in bytes of additional props trie:54760
number of additional props vectors:      673
number of 32-bit words per vector:         3
size in bytes of Age trie:             21596
size in bytes of Block trie:            7600
size in bytes of Script trie:          34420
number of 16-bit scriptExtensions:       298
data size:                            173440

Same, but small tries for Age & Script:

trie size in bytes:                    46328
size in bytes of additional props trie:54760
number of additional props vectors:      673
number of 32-bit words per vector:         3
size in bytes of Age trie:             17128
size in bytes of Block trie:            7600
size in bytes of Script trie:          25984
number of 16-bit scriptExtensions:       298
data size:                            160536

Pulling only Block out into a separate trie:

trie size in bytes:                    46328
size in bytes of additional props trie:62752
number of additional props vectors:     2026
number of 32-bit words per vector:         3
size in bytes of Block trie:            7600
number of 16-bit scriptExtensions:       298
data size:                            141652

Only this last version actually makes uprops.icu slightly smaller than the original.

FYI @echeran

Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21812
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [ ] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

markusicu avatar Mar 26 '24 21:03 markusicu