efficient-language-detector icon indicating copy to clipboard operation
efficient-language-detector copied to clipboard

Question regarding benchmark Lingua comparison

Open Marcono1234 opened this issue 1 year ago • 4 comments

Hello, in your benchmark in the README you got pretty bad performance for Lingua. How exactly do you execute Lingua? Lingua uses quite large models which have to be loaded once (or lazily during usage), but afterwards detection speed should be quite fast if you keep reusing the same detector I think (which is the intended usage). However, if you keep creating new detector instances for every detection, then performance will be rather bad. Also, Lingua requires a lot of memory during runtime, so if you are running it in a memory-constrained environment, maybe its performance will not be that good either.

Have you tried Lingua version 2 as well[^1]? It is based on the Rust implementation and its performance will likely be better. For measuring performance it might also be useful to:

Thanks for doing this benchmark in the first place though!

[^1]: That version might also cover more than the 54 languages you mention in the README.

Marcono1234 avatar Jul 07 '24 16:07 Marcono1234

Soon I'm going to redo all benchmarks, for an ELD v3, so it is a good opportunity to fix anything that might be incorrect.

For lingua I use the same detector for each line, so that is not the problem. I did the benchmarks on a 16GB machine, now I have 32GB. I don't see any problem with memory, it uses ~400mb, not too much really. On windows 10. I was surprised at how slow it was, I tried different things, but I also saw others had the same problem.

Have you tried it? Lingua <2.0 against any of the other detectors I tested to see if the performance difference matches?

I have not tried Lingua v2, I guess I will for the new benchmarks.

nitotm avatar Jul 08 '24 11:07 nitotm

I did the benchmarks on a 16GB machine, now I have 32GB. I don't see any problem with memory, it uses ~400mb, not too much really.

Yes you are right, that should be more than enough.

Have you tried it? Lingua <2.0 against any of the other detectors I tested to see if the performance difference matches?

Sorry, I hadn't actually tried Lingua < 2.0 yet. But I have compared Lingua 1.3.5 and 2.0.2 now:

Lingua version Loading all models[^1] Detection[^2]
1.3.5 29.02s 233.68s
2.0.2 8.43s 21.97s

So it seems you are right, the performance of Lingua < 2.0 is really not that great. Would really be worth it giving Lingua 2 a try.

[^1]: Using LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models() [^2]: I was testing detection 1000 times of 16 sentences in different languages; though the absolute time value might not be that interesting here, rather the ratio between the Lingua versions

Marcono1234 avatar Jul 14 '24 21:07 Marcono1234

I’m redoing the benchmarks for v3, and I’m trying Lingua 2.0.2, what a difference really, with my installation of 1.3.2 I’m seeing a great difference. I’m also using with_preloaded_language_models() and it is reasonably fast now. I will close the issue when I publish v3

nitotm avatar Aug 16 '24 15:08 nitotm

I uploaded ELD v3-beta with the new benchmarks, now Lingua is reasonably fast.

I still find discrepancies in their benchmarks, according to them Lingua-low is x2 slower than fasttext, which is fine; I tested x2-x5 depending on the benchmark, but then their test with CLD2 is very similar in speed to fasttext, and I think CLD2 should be >= x2 faster than fasttext.
(Also, their benchmark for CLD2 is unfair, as they are not using bestEffort = True which would improve its accuracy considerably)

Discussion for v3-beta at: https://github.com/nitotm/efficient-language-detector/discussions/10

nitotm avatar Sep 05 '24 14:09 nitotm