lingua-rs icon indicating copy to clipboard operation
lingua-rs copied to clipboard

Load models slightly more eagerly and reuse for all ngrams during detection.

Open serega opened this issue 1 year ago • 1 comments

The library puts emphasis on lazy loading in order to be efficient in certain servless environments. My use-case requires maximum performance, and I can sacrify slower startup for better runtime performance. The LanguageDetectorBuilder has a flag with_preloaded_language_models that forces eager loading of models. However, during detection the LanguageDetector loops through every ngram and calls the function load_language_models, that has to take read lock to check whether the model is loaded. Read locks are cheap, but not free. As an experiment I completely eliminated lazy loading and stored models in the LanguageDetector. Single threaded benchmark improved by 1.2x-2x and multi-threaded by 2x-4x depending on the system.

Patch

This patch works with lazy loading design. Instead of lazy-loading model for every ngram, I load models slightly more eagerly in compute_sum_of_ngram_probabilities and in count_unigrams for the specified language, and reuse the models for ngrams in the loop. For this to work I had to use Arc instead of Box in the BoxedLanguageModel

Benchmark

For the benchmark I used the accuracy reports test data. The benchmark code is here. I tested both single-threaded and multi-threaded/parallel mode.

Results

I tested the patch on two machines

  • Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz on Linux
  • M1 Max on Mac

The numbers in the columns Before and After are throughput as detections per second.

Single threaded benchmark

cargo run --release --bin bench -- --max-examples 30000

Machine Before After Change
i7-6800K 1162 1294 1.11x
M1 Max 1592 1844 1.15x

Multi threaded benchmark

cargo run --release --bin bench -- --max-examples 50000 --parallel

Machine Before After Change
i7-6800K 4226 8033 1.9x
M1 Max 1383 5858 4.23x

I am really surprised by the numbers on M1 chip. Before, single-threaded benchmark runs faster than multi-threaded, which means RwLocks on M1-Mac are slow.

I ran accuracy reports and checked against the current main branch diff -r accuracy-reports/lingua/ ../lingua-rs-main/accuracy-reports/lingua/ The command found no differences.

serega avatar Jul 04 '22 03:07 serega

Hi @serega, thank you for this PR. This looks interesting. I'm quite busy currently but will definitely look into your changes as soon as I find the time.

pemistahl avatar Jul 07 '22 09:07 pemistahl

Hi @pemistahl . I have two more relatively simple patches for performance improvements that cut processing time by another half. I use these patches in production, but I would like to use your original project. Wondering if I should submit my patches.

serega avatar Jan 19 '23 15:01 serega

Hi @serega, I'm sorry for not having merged your changes yet. I was busy with other projects in the last months. But I plan to start working on the Rust implementation of Lingua again. I maintain four implementations of Lingua (Rust, Go, Kotlin, Python). The Go and Python ones are the newest as far as new features and improvements are concerned. The goal is to adapt those features and improvements to the Rust and Kotlin implementations as well.

Can you please update this PR with your new patches? I will then do a new review. Thank you.

pemistahl avatar Jan 20 '23 10:01 pemistahl

Hi @pemistahl. I know about other implementations too. Thanks 👍 . We have a team using the Kotlin version, although I know they are having issues with memory usage. Having multiple unrelated changes in the same pull request would be too challenging to review I think. #148 pull request contains one of two other patches I use.

serega avatar Jan 21 '23 12:01 serega

I have now implemented this concept in c0f7e71, so this PR can be closed.

pemistahl avatar May 24 '23 19:05 pemistahl