lingua-rs
lingua-rs copied to clipboard
Load models slightly more eagerly and reuse for all ngrams during detection.
The library puts emphasis on lazy loading in order to be efficient in certain servless environments. My use-case requires maximum performance, and I can sacrify slower startup for better runtime performance. The LanguageDetectorBuilder
has a flag with_preloaded_language_models
that forces eager loading of models. However, during detection the LanguageDetector
loops through every ngram and calls the function load_language_models
, that has to take read lock to check whether the model is loaded. Read locks are cheap, but not free. As an experiment I completely eliminated lazy loading and stored models in the LanguageDetector
. Single threaded benchmark improved by 1.2x-2x and multi-threaded by 2x-4x depending on the system.
Patch
This patch works with lazy loading design. Instead of lazy-loading model for every ngram, I load models slightly more eagerly in compute_sum_of_ngram_probabilities
and in count_unigrams
for the specified language, and reuse the models for ngrams in the loop. For this to work I had to use Arc
instead of Box
in the BoxedLanguageModel
Benchmark
For the benchmark I used the accuracy reports test data. The benchmark code is here. I tested both single-threaded and multi-threaded/parallel mode.
Results
I tested the patch on two machines
- Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz on Linux
- M1 Max on Mac
The numbers in the columns Before and After are throughput as detections per second.
Single threaded benchmark
cargo run --release --bin bench -- --max-examples 30000
Machine | Before | After | Change |
---|---|---|---|
i7-6800K | 1162 | 1294 | 1.11x |
M1 Max | 1592 | 1844 | 1.15x |
Multi threaded benchmark
cargo run --release --bin bench -- --max-examples 50000 --parallel
Machine | Before | After | Change |
---|---|---|---|
i7-6800K | 4226 | 8033 | 1.9x |
M1 Max | 1383 | 5858 | 4.23x |
I am really surprised by the numbers on M1 chip. Before, single-threaded benchmark runs faster than multi-threaded, which means RwLocks on M1-Mac are slow.
I ran accuracy reports and checked against the current main branch
diff -r accuracy-reports/lingua/ ../lingua-rs-main/accuracy-reports/lingua/
The command found no differences.
Hi @serega, thank you for this PR. This looks interesting. I'm quite busy currently but will definitely look into your changes as soon as I find the time.
Hi @pemistahl . I have two more relatively simple patches for performance improvements that cut processing time by another half. I use these patches in production, but I would like to use your original project. Wondering if I should submit my patches.
Hi @serega, I'm sorry for not having merged your changes yet. I was busy with other projects in the last months. But I plan to start working on the Rust implementation of Lingua again. I maintain four implementations of Lingua (Rust, Go, Kotlin, Python). The Go and Python ones are the newest as far as new features and improvements are concerned. The goal is to adapt those features and improvements to the Rust and Kotlin implementations as well.
Can you please update this PR with your new patches? I will then do a new review. Thank you.
Hi @pemistahl. I know about other implementations too. Thanks 👍 . We have a team using the Kotlin version, although I know they are having issues with memory usage. Having multiple unrelated changes in the same pull request would be too challenging to review I think. #148 pull request contains one of two other patches I use.
I have now implemented this concept in c0f7e71, so this PR can be closed.