nlprule icon indicating copy to clipboard operation
nlprule copied to clipboard

Improve loading speed (of regex?) - cli usecase

Open drahnr opened this issue 3 years ago • 13 comments

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided and shifted to build time / once upon a time. The current issue here is the impl of the crate regex itself which does not impl serde serialization.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

drahnr avatar Apr 01 '21 08:04 drahnr

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

https://github.com/bminixhofer/nlprule/blob/afc2d3ef43de900c500891b5db8d6f26fdaa15fb/nlprule/src/utils/regex.rs#L70-L80

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

https://github.com/bminixhofer/nlprule/blob/afc2d3ef43de900c500891b5db8d6f26fdaa15fb/nlprule/src/tokenizer/tag.rs#L102

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

bminixhofer avatar Apr 01 '21 09:04 bminixhofer

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

Note the constraint: cli - and there the load time is very significant for the end user.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

I've had a quick peek into the source of regex, and that task is actually non-trivial unfortunately.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

https://github.com/bminixhofer/nlprule/blob/afc2d3ef43de900c500891b5db8d6f26fdaa15fb/nlprule/src/utils/regex.rs#L70-L80

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

https://github.com/bminixhofer/nlprule/blob/afc2d3ef43de900c500891b5db8d6f26fdaa15fb/nlprule/src/tokenizer/tag.rs#L102

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

Yeah, I'll add some criterion benches eventually next week.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

Just making sure I am not investing time + effort for the :wastebasket: :wink:

drahnr avatar Apr 01 '21 09:04 drahnr

Note the constraint: cli - and there the load time is very significant for the end user.

Right, for CLI that is definitely true.

Yeah, I'll add some criterion benches eventually next week.

That would be great! Some very simple benchmark + cargo-flamegraph should be enough to track down the biggest contributors, at least that's how I did it for the core.

bminixhofer avatar Apr 01 '21 11:04 bminixhofer

I took the lazy path and below is the current flamegraph of cargo-spellcheck w/ 0.5.3 and you are correct, most time is spent in deserialization which is quite unfortunate.

flamegraph.svg

Not quite sure how to approach this, I'd like to just do all the hard work in build.rs and serialize to smth that is zer0 copy and deserialize that in the actual application, i.e. https://lib.rs/crates/rkyv or https://capnproto.org/ - which both are clearly out of scope for this crate.

So my question here is, how much one can expose for deserialization purposes to make that easier to swap out as needed. I'll think a bit about this at some point this week.

drahnr avatar Apr 07 '21 08:04 drahnr

Thanks for the graph! It seems almost all of the work is indeed done in the Tagger deserialization. In general, deserializing from an FST is necessary since otherwise size of the binary blows up, but it makes sense that it's slow.

Here's the problematic code:

https://github.com/bminixhofer/nlprule/blob/12bd52fd7cd5b95b94205625ebb274dbd66a6d8b/nlprule/src/tokenizer/tag.rs#L102-L153

I can give parallelizing this to some degree a go, that's not trivial here but it should help.

(also, I just noticed the into_iter + map + collect in L111 above can be easily avoided so that should already measurably improve things, but it seems the bulk of the work is done in the loop starting at L119)

bminixhofer avatar Apr 07 '21 09:04 bminixhofer

tl;dr need a good test case to improve this for real.


My measurements really were only a few runs of cargo-spellcheck on the same file with different impls, and comparing the total runtime percentage of the Tagger::from impl. So take it with a grain of salt.


The above code snippet seems slow due to the additional allocations, in reality, the change is insignificant and is slightly faster as is (probably due to some cache locallity) compared to:

        let word_store_fst = Map::new(data.word_store_fst).unwrap();
        let mut word_store = FastBiMap::<String, WordIdInt>::with_capacity_and_hashers(
            word_store_fst.len(),
            Default::default(),
            Default::default(),
        );
        let mut stream = word_store_fst.into_stream();
        while let Some((key, value)) = stream.next() {
            if let Some(key) = std::str::from_utf8(key).ok() {
                word_store.insert(key.to_owned(), WordIdInt(value as u32));
            }
        };

fnv hasher is supposed to be pretty fast for small keys, which is really all there is, but it appears hashbrown beats fnv, maybe due to SMID lookup that is advertised.

drahnr avatar Apr 07 '21 15:04 drahnr

Some more digging revealed that the data really is very sparse. We do 2 allocations for structure and the intermediate combinator key, where everything holds 1 element only in 99% of all cases. So at this point, I am tempted to say that using Vec for the inner 2 layers should already give a pretty good speedup. Another approach would be to eliminate the intermediate layer,

                .entry(inflection_id)

which also barely used, but in get_raw afaiu. Using a Vec<(WordIdInt, PosIdInt)> should yield significantly better perf given our data sizes of mostly 1, up to 3 items.

drahnr avatar Apr 07 '21 16:04 drahnr

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

bminixhofer avatar Apr 07 '21 16:04 bminixhofer

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

One retains the current 2 layer inner nesting structure, the other flattens the IndexMap<_,Vec<_> to a single Vec

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

That could work, but the most significant part is not the insertion, but just calling next() on the Stream - I tried to write a safe wrapper as an iterator to then be able to use rayon with a parallel iterator adapter (which suffers the same issues).

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

I think fixing the topology of the Tagger should be the first step, I am not very deep into the code yet, so I don't know if that already implies adding multiple FSTs or not.

drahnr avatar Apr 07 '21 16:04 drahnr

Please rebase this onto main, there's now a simple bench for rules and tokenizer loading speed, and a CI fix.

The Tagger is not well documented at the moment, I'll try to get around to better documentation a bit later so you know roughly what's going on if you're working on it :)

bminixhofer avatar Apr 08 '21 07:04 bminixhofer

That'd be much appreciated, I'll have some time tomorrow night to look into this further.

drahnr avatar Apr 08 '21 07:04 drahnr

Alright, there's better doc comments for the tagger on main now. I think I covered the important parts, let me know if anything is unclear!

bminixhofer avatar Apr 08 '21 11:04 bminixhofer

This is somewhat addressed by #66 and #70. I'll keep this open because there's currently (at least) two more things worth investigating in this area:

  1. Track loading speed (and possibly some other metrics) over time using e.g. https://github.com/rhysd/github-action-benchmark to prevent regressions.
  2. Investigate parallelization of deserialization across structs. For example, the Tagger and the three Chunker models are all quite expensive to deserialize, and all of that happens sequentially at the moment although it could be done in parallel. I'm not sure how difficult this would be, there's some related discussion in https://github.com/bincode-org/bincode/issues/139 but I don't think this will land in serde anytime soon.

bminixhofer avatar Apr 28 '21 12:04 bminixhofer