unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

enhancement: evaluate language detection packages to see which is best for detecting short text

Open Coniferish opened this issue 10 months ago • 2 comments

Stemming from conversations here and here, it would be worthwhile to do our own comparison of language detection packages to see which is best for detecting the language of short text. Speed and size of the packages should also be considered. Packages of interest include langdetect (which we are currently using), fasttext (if it is compatible with py 3.11), langid, and lingua. We are also currently using a regex pattern and arbitrary text length limit to default to "eng", so this should also be considered/reconsidered.

See detect_languages in lang.py

Coniferish avatar Oct 06 '23 01:10 Coniferish