icu4x
icu4x copied to clipboard
Improve performance of icu_codepointtrie_builder
Based on some preliminary testing, it appears that the slowest component of datagen when run in release mode is the interface between Rust and WASM in icu_codepointtrie_builder.
The command line we're trying to optimize is:
$ time cargo run --release --bin=icu4x-datagen --features=experimental,bin -- --uprops-root=provider/testdata/data/uprops --syntax=postcard --out=/tmp/icu4x_data --overwrite --keys=segmenter/word@1 --all-locales
On my machine, this command currently reports:
real 0m2.050s
user 0m2.050s
sys 0m0.068s
At least 75% of that time seems to be spent in the I/O between Rust and WASM.
We currently have the following abstraction:
pub enum CodePointTrieBuilderData<'a, T> {
/// A list of values for each code point, starting from code point 0.
///
/// For example, the value for U+0020 (space) should be at index 32 in the slice.
/// Index 0 sets the value for the U+0000 (NUL).
ValuesByCodePoint(&'a [T]),
}
Then, all of that data is piped into the STDIN of the WASM binary. It would likely be much more efficient if we were to add a mode that consumed data in a format such as
/// A list of code point ranges and the values of each range.
ValuesByRange(&'a [CodePointMapRange<T>]),
We also need a corresponding flag in list_to_ucptrie.cpp to accept data in a similar, more compact format. For example, instead of accepting 1 value per line, the format could be "T N", which means "set value V for N code points".
Instead of adding another entry into CodePointTrieBuilderData (or perhaps in addition to that), we should consider transforming the ValuesByCodePoint slice into the more optimal form when piping it into C++.
I'd like to see us reduce the runtime to less than 0.5s overall (down from 2.0s).