Use Lingua instead of pycld3 for language detection

Open osma opened this issue 1 year ago • 7 comments

This draft PR fixes #593 by switching from the pycld3 language detection library to Lingua (by @pemistahl).

Lingua is used in the low accuracy mode, because it is much faster than the high accuracy mode and needs a lot less memory. I tested the high accuracy mode very briefly but just the startup overhead was so high (tens of seconds) that I considered it a non-starter.

I did a little benchmarking using the Annif tutorial yso-nlf data set and two project configurations used in the tutorial with two backend algorithms, MLLM and Omikuji Parabel. I compared current master (which uses pycld3) to this PR branch which uses Lingua 1.1.1. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:

[yso-mllm-en-filter]
name=YSO MLLM project
language=en
backend=mllm
vocab=yso-en
analyzer=snowball(english)
transform=limit(10000),filter_lang,limit(5000)

[yso-omikuji-parabel-en-filter]
name=Omikuji Parabel English
language=en
backend=omikuji
analyzer=snowball(english)
vocab=yso-en
transform=limit(10000),filter_lang,limit(5000)

For the unfiltered baseline I used transform=limit(5000) instead.

Here are some performance stats (total user time over all CPU cores and maximum resident set size) that I measured using /usr/bin/time -v:

operation	notes	no filter time	no filter mem	pycld3 time	pycld3 mem	lingua-low time	lingua-low mem
pytest	optionals: dev,omikuji,pycld3/lingua	-	-	76	1302904	77	1299056
loadvoc yso	yso-skos.ttl	121	2468144	-	-	120	2466232
train mllm	-d 2000 -j 8	646	1876336	608	1879700	2176	1893320
suggest mllm	2017-D-52518.txt	7	283548	7	278576	14	291960
eval mllm	-j 8	128	360836	131	361688	624	359360
train omikuji	-j 8 yso-finna-small.tsv.gz	124	663368	125	682708	131	684188
suggest omikuji	2017-D-52518.txt	5	400784	6	396836	14	410744
eval omikuji	-j 8	22	483988	29	483704	513	481508

Here are the evaluation results (running annif eval on the 300 documents in the test set and measuring F1@5 and nDCG scores - higher is better):

Project type	no filter f1@5	no filter ndcg	pycld3 f1@5	pycld3 ndcg	lingua-low f1@5	lingua-low ndcg
mllm	0.3276	0.4334	0.3236	0.4282	0.3183	0.4228
omikuji	0.2562	0.3543	0.2435	0.3287	0.2524	0.3385

The good news:

Lingua starts up quickly (in low-accuracy mode)
Lingua doesn't use any more memory than pycld3 (in low-accuracy mode)

The bad news:

Lingua is still a lot slower than pycld3 in the grunt work of filtering long documents sentence by sentence. For example, when training MLLM with 2000 documents (truncated to max 10000 characters each by the limit filter), the user time increased from ~600 to ~2100 seconds. Likewise, evaluation time on 300 documents increased by ~500 seconds. For suggest operations on a single document, the increase was ~7 seconds (but this most likely includes some initialization overhead, so the next document would have been processed faster).
In general, this experiment didn't show any benefit of language filtering. The evaluation results were actually best for the baseline experiment with no filtering; using either pycld3 or Lingua just made the results worse. For other data sets and languages, the situation could be different.
Maybe this is nitpicking, but it was surprisingly hard to get a standard lowercase ISO 639-1 language code out of Lingua. Eventually I found out that using an expression like result.iso_code_639_1.name.lower() does the trick. pycld3 returns language codes directly, which makes the API easier to use.

I think the take home message is that if Lingua could be made faster still for the detection process, then we could consider switching to it. Right now it seems that the performance cost is quite high. It would also be nice to identify a data set where the language filtering actually improves results; we could then measure whether Lingua does this better than pycld3 or not. This data set was not a good choice in that respect.

Aug 26 '22 14:08 osma

Codecov Report

Base: 99.61% // Head: 99.59% // Decreases project coverage by -0.02% :warning:

Coverage data is based on head (3bbe813) compared to base (ec10014). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #615      +/-   ##
==========================================
- Coverage   99.61%   99.59%   -0.03%     
==========================================
  Files          87       87              
  Lines        6038     5946      -92     
==========================================
- Hits         6015     5922      -93     
- Misses         23       24       +1

Impacted Files	Coverage Δ
annif/transform/__init__.py	`100.00% <ø> (ø)`
tests/test_transform_langfilter.py	`100.00% <ø> (ø)`
annif/transform/langfilter.py	`96.42% <100.00%> (-3.58%)`	:arrow_down:
annif/cli.py	`99.67% <0.00%> (-0.02%)`	:arrow_down:
tests/test_cli.py	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Aug 26 '22 14:08 codecov[bot]

Thank you @osma for adding my library to your evaluation. :)

It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports exporting Rust enums as Python enums, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.

It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.

Aug 26 '22 15:08 pemistahl

It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports https://github.com/PyO3/pyo3/issues/417, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.

Understood. But I think the current Lingua implementation (with NumPy vectors) is slower than it needs to be because of the O(log(n)) lookups - having to do binary searches in big sorted arrays. This is not a question of implementation language but of algorithmic efficiency. Even pure Python (or in this case, helped along by NumPy) can be quite fast. I wrote some ideas about further optimization of Lingua in this discussion.

It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.

The problem here is that sticking to CLD3 is not a good option, as explained in the OP of #593 - its most active Python binding library (pycld3) appears to not be actively maintained anymore, and the other ones (cld3, gcld3) are even older. pycld3 doesn't work with Python 3.10. So unless someone starts maintaining it again, we will need to switch to something else.

Aug 26 '22 18:08 osma

Hi @osma,

I have just released Lingua 1.1.2 which removes the most significant performance problems of the previous version. The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly (between 1 and 2 seconds for all models on my machine). I have also removed a bottleneck in the language detection code which makes language detection 40 % faster, approximately.

Can you please do your evaluation again with the new version? Would you now consider switching to my library?

Thanks. :)

Sep 06 '22 20:09 pemistahl

Thanks @pemistahl for the update, that is great news!

I will try to do a new round of experiments soon, comparing language filtering with either pycld3, Lingua or the recently added language detection functionality in Simplemma. This time I will use a dataset that actually should benefit from the filtering - the tutorial data set I used above was a bit disappointing in this respect.

Sep 23 '22 11:09 osma

Rebased this PR branch on current master and force-pushed. Also upgraded to Lingua 1.1.2.

Sep 23 '22 13:09 osma

Kudos, SonarCloud Quality Gate passed!