icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Improve performance of icu_codepointtrie_builder

Open sffc opened this issue 3 years ago • 0 comments

Based on some preliminary testing, it appears that the slowest component of datagen when run in release mode is the interface between Rust and WASM in icu_codepointtrie_builder.

The command line we're trying to optimize is:

$ time cargo run --release --bin=icu4x-datagen --features=experimental,bin -- --uprops-root=provider/testdata/data/uprops --syntax=postcard --out=/tmp/icu4x_data --overwrite --keys=segmenter/word@1 --all-locales

On my machine, this command currently reports:

real	0m2.050s
user	0m2.050s
sys	0m0.068s

At least 75% of that time seems to be spent in the I/O between Rust and WASM.

We currently have the following abstraction:

pub enum CodePointTrieBuilderData<'a, T> {
    /// A list of values for each code point, starting from code point 0.
    ///
    /// For example, the value for U+0020 (space) should be at index 32 in the slice.
    /// Index 0 sets the value for the U+0000 (NUL).
    ValuesByCodePoint(&'a [T]),
}

Then, all of that data is piped into the STDIN of the WASM binary. It would likely be much more efficient if we were to add a mode that consumed data in a format such as

    /// A list of code point ranges and the values of each range.
    ValuesByRange(&'a [CodePointMapRange<T>]),

We also need a corresponding flag in list_to_ucptrie.cpp to accept data in a similar, more compact format. For example, instead of accepting 1 value per line, the format could be "T N", which means "set value V for N code points".

Instead of adding another entry into CodePointTrieBuilderData (or perhaps in addition to that), we should consider transforming the ValuesByCodePoint slice into the more optimal form when piping it into C++.

I'd like to see us reduce the runtime to less than 0.5s overall (down from 2.0s).

sffc avatar May 13 '22 17:05 sffc