Annif
Annif copied to clipboard
Use Lingua instead of pycld3 for language detection
This draft PR fixes #593 by switching from the pycld3 language detection library to Lingua (by @pemistahl).
Lingua is used in the low accuracy mode, because it is much faster than the high accuracy mode and needs a lot less memory. I tested the high accuracy mode very briefly but just the startup overhead was so high (tens of seconds) that I considered it a non-starter.
I did a little benchmarking using the Annif tutorial yso-nlf data set and two project configurations used in the tutorial with two backend algorithms, MLLM and Omikuji Parabel. I compared current master (which uses pycld3) to this PR branch which uses Lingua 1.1.1. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:
[yso-mllm-en-filter]
name=YSO MLLM project
language=en
backend=mllm
vocab=yso-en
analyzer=snowball(english)
transform=limit(10000),filter_lang,limit(5000)
[yso-omikuji-parabel-en-filter]
name=Omikuji Parabel English
language=en
backend=omikuji
analyzer=snowball(english)
vocab=yso-en
transform=limit(10000),filter_lang,limit(5000)
For the unfiltered baseline I used transform=limit(5000)
instead.
Here are some performance stats (total user time over all CPU cores and maximum resident set size) that I measured using /usr/bin/time -v
:
operation | notes | no filter time | no filter mem | pycld3 time | pycld3 mem | lingua-low time | lingua-low mem |
---|---|---|---|---|---|---|---|
pytest | optionals: dev,omikuji,pycld3/lingua | - | - | 76 | 1302904 | 77 | 1299056 |
loadvoc yso | yso-skos.ttl | 121 | 2468144 | - | - | 120 | 2466232 |
train mllm | -d 2000 -j 8 | 646 | 1876336 | 608 | 1879700 | 2176 | 1893320 |
suggest mllm | 2017-D-52518.txt | 7 | 283548 | 7 | 278576 | 14 | 291960 |
eval mllm | -j 8 | 128 | 360836 | 131 | 361688 | 624 | 359360 |
train omikuji | -j 8 yso-finna-small.tsv.gz | 124 | 663368 | 125 | 682708 | 131 | 684188 |
suggest omikuji | 2017-D-52518.txt | 5 | 400784 | 6 | 396836 | 14 | 410744 |
eval omikuji | -j 8 | 22 | 483988 | 29 | 483704 | 513 | 481508 |
Here are the evaluation results (running annif eval
on the 300 documents in the test set and measuring F1@5 and nDCG scores - higher is better):
Project type | no filter f1@5 | no filter ndcg | pycld3 f1@5 | pycld3 ndcg | lingua-low f1@5 | lingua-low ndcg |
---|---|---|---|---|---|---|
mllm | 0.3276 | 0.4334 | 0.3236 | 0.4282 | 0.3183 | 0.4228 |
omikuji | 0.2562 | 0.3543 | 0.2435 | 0.3287 | 0.2524 | 0.3385 |
The good news:
- Lingua starts up quickly (in low-accuracy mode)
- Lingua doesn't use any more memory than pycld3 (in low-accuracy mode)
The bad news:
- Lingua is still a lot slower than pycld3 in the grunt work of filtering long documents sentence by sentence. For example, when training MLLM with 2000 documents (truncated to max 10000 characters each by the limit filter), the user time increased from ~600 to ~2100 seconds. Likewise, evaluation time on 300 documents increased by ~500 seconds. For suggest operations on a single document, the increase was ~7 seconds (but this most likely includes some initialization overhead, so the next document would have been processed faster).
- In general, this experiment didn't show any benefit of language filtering. The evaluation results were actually best for the baseline experiment with no filtering; using either pycld3 or Lingua just made the results worse. For other data sets and languages, the situation could be different.
- Maybe this is nitpicking, but it was surprisingly hard to get a standard lowercase ISO 639-1 language code out of Lingua. Eventually I found out that using an expression like
result.iso_code_639_1.name.lower()
does the trick. pycld3 returns language codes directly, which makes the API easier to use.
I think the take home message is that if Lingua could be made faster still for the detection process, then we could consider switching to it. Right now it seems that the performance cost is quite high. It would also be nice to identify a data set where the language filtering actually improves results; we could then measure whether Lingua does this better than pycld3 or not. This data set was not a good choice in that respect.
Codecov Report
Base: 99.61% // Head: 99.59% // Decreases project coverage by -0.02%
:warning:
Coverage data is based on head (
3bbe813
) compared to base (ec10014
). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #615 +/- ##
==========================================
- Coverage 99.61% 99.59% -0.03%
==========================================
Files 87 87
Lines 6038 5946 -92
==========================================
- Hits 6015 5922 -93
- Misses 23 24 +1
Impacted Files | Coverage Δ | |
---|---|---|
annif/transform/__init__.py | 100.00% <ø> (ø) |
|
tests/test_transform_langfilter.py | 100.00% <ø> (ø) |
|
annif/transform/langfilter.py | 96.42% <100.00%> (-3.58%) |
:arrow_down: |
annif/cli.py | 99.67% <0.00%> (-0.02%) |
:arrow_down: |
tests/test_cli.py | 100.00% <0.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Thank you @osma for adding my library to your evaluation. :)
It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports exporting Rust enums as Python enums, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.
It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.
It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports https://github.com/PyO3/pyo3/issues/417, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.
Understood. But I think the current Lingua implementation (with NumPy vectors) is slower than it needs to be because of the O(log(n)) lookups - having to do binary searches in big sorted arrays. This is not a question of implementation language but of algorithmic efficiency. Even pure Python (or in this case, helped along by NumPy) can be quite fast. I wrote some ideas about further optimization of Lingua in this discussion.
It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.
The problem here is that sticking to CLD3 is not a good option, as explained in the OP of #593 - its most active Python binding library (pycld3) appears to not be actively maintained anymore, and the other ones (cld3, gcld3) are even older. pycld3 doesn't work with Python 3.10. So unless someone starts maintaining it again, we will need to switch to something else.
Hi @osma,
I have just released Lingua 1.1.2 which removes the most significant performance problems of the previous version. The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly (between 1 and 2 seconds for all models on my machine). I have also removed a bottleneck in the language detection code which makes language detection 40 % faster, approximately.
Can you please do your evaluation again with the new version? Would you now consider switching to my library?
Thanks. :)
Thanks @pemistahl for the update, that is great news!
I will try to do a new round of experiments soon, comparing language filtering with either pycld3, Lingua or the recently added language detection functionality in Simplemma. This time I will use a dataset that actually should benefit from the filtering - the tutorial data set I used above was a bit disappointing in this respect.
Rebased this PR branch on current master
and force-pushed. Also upgraded to Lingua 1.1.2.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells
No Coverage information
0.0% Duplication