unstructured
unstructured copied to clipboard
enhancement: evaluate language detection packages to see which is best for detecting short text
Stemming from conversations here and here, it would be worthwhile to do our own comparison of language detection packages to see which is best for detecting the language of short text. Speed and size of the packages should also be considered. Packages of interest include langdetect (which we are currently using), fasttext (if it is compatible with py 3.11), langid, and lingua. We are also currently using a regex pattern and arbitrary text length limit to default to "eng", so this should also be considered/reconsidered.
See detect_languages
in lang.py
Notes/research from 10/2023
Package | Url | Stars | Last updated | License | Python compatibility | Detects multiple languages in the same text | Deterministic? |
---|---|---|---|---|---|---|---|
Langdetect | https://github.com/Mimino666/langdetect | 1.5k | 2021 | ||||
Polyglot | https://github.com/aboSamoor/polyglot | 2.2k | 2020 | GPLv3 | |||
~~python-polyglot~~ | https://github.com/lainq/polyglot | ||||||
cld3 | https://github.com/google/cld3 | 706 | 2022 | Apache 2.0 | |||
pycld3 | https://github.com/bsolomon1124/pycld3 | 135 | 2021 | Apache 2.0 | |||
pycld2 | https://github.com/aboSamoor/pycld2 | 147 | 2022 | Apache 2.0 | |||
Fasttext | https://github.com/facebookresearch/fastText | 25.1k | 2023 | MIT | |||
~~spaCy~~ | https://github.com/explosion/spaCy | ||||||
lingua | https://github.com/wichert/lingua | 44 | 2022 | ||||
lingua-py | https://github.com/pemistahl/lingua-py | 635 | 2023 | apache 2.0 | |||
~~googletrans~~ | https://github.com/ssut/py-googletrans | ||||||
~~textblob~~ | https://github.com/sloria/TextBlob | 8.7k | 2023 | ||||
langid | https://github.com/saffsd/langid.py | 2.2k | 2017 | BSD-2-Clause | |||
py3langid | https://github.com/adbar/py3langid | 26 | 2022 | BSD 3-Clause | Python >= 3.6 | not documented |
Thanks for that table. Seems like Polyglot is actually GPLv3, not sure if that was changed recently.