unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

enhancement: evaluate language detection packages to see which is best for detecting short text

Open Coniferish opened this issue 1 year ago • 2 comments

Stemming from conversations here and here, it would be worthwhile to do our own comparison of language detection packages to see which is best for detecting the language of short text. Speed and size of the packages should also be considered. Packages of interest include langdetect (which we are currently using), fasttext (if it is compatible with py 3.11), langid, and lingua. We are also currently using a regex pattern and arbitrary text length limit to default to "eng", so this should also be considered/reconsidered.

See detect_languages in lang.py

Coniferish avatar Oct 06 '23 01:10 Coniferish

Notes/research from 10/2023

Package Url Stars Last updated License Python compatibility Detects multiple languages in the same text Deterministic?
Langdetect https://github.com/Mimino666/langdetect 1.5k 2021        
Polyglot https://github.com/aboSamoor/polyglot 2.2k 2020 GPLv3      
~~python-polyglot~~ https://github.com/lainq/polyglot        
cld3 https://github.com/google/cld3 706 2022 Apache 2.0      
pycld3 https://github.com/bsolomon1124/pycld3 135 2021 Apache 2.0      
pycld2 https://github.com/aboSamoor/pycld2 147 2022 Apache 2.0      
Fasttext https://github.com/facebookresearch/fastText 25.1k 2023 MIT      
~~spaCy~~ https://github.com/explosion/spaCy            
lingua https://github.com/wichert/lingua 44 2022        
lingua-py https://github.com/pemistahl/lingua-py 635 2023 apache 2.0      
~~googletrans~~ https://github.com/ssut/py-googletrans            
~~textblob~~ https://github.com/sloria/TextBlob 8.7k 2023        
langid https://github.com/saffsd/langid.py 2.2k 2017 BSD-2-Clause      
py3langid https://github.com/adbar/py3langid 26 2022 BSD 3-Clause Python >= 3.6   not documented

Coniferish avatar Oct 16 '23 16:10 Coniferish

Thanks for that table. Seems like Polyglot is actually GPLv3, not sure if that was changed recently.

jbne avatar May 09 '24 17:05 jbne