icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

char16trie converter

Open makotokato opened this issue 1 year ago • 1 comments

Now segmenter uses char16trie for dictionary segmenter. East Asian dictionary can remove/move to LSTM, but Chinese and Japanese still use it.

Actually, current data is generated by ICU4C's tools then binary data by that tool converted to TOML file. So I guess that it is better to add generation tools for char16trie from ICU4C's dictionary text file.

makotokato avatar Sep 06 '22 01:09 makotokato

Consider doing like we did for the CodePointTrieBuilder. Rather than writing the code ourselves, we compile the ICU4C builder code into a WASM file and ship that in our repo.

sffc avatar Sep 06 '22 04:09 sffc

@makotokato Does this block any other issues? Can you set an assignee (or "help wanted") and a milestone (or "backlog")?

sffc avatar Oct 17 '22 21:10 sffc

@makotokato Does this block any other issues? Can you set an assignee (or "help wanted") and a milestone (or "backlog")?

Not blocker.

makotokato avatar Oct 18 '22 23:10 makotokato