py3langid
py3langid copied to clipboard
Incorrect detection
Hi guys! I use py3langid==0.2.2 and I found that in some cases Chinese language has higher probability than it probably should be. For example
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
identifier.rank("Al furjan")
outputs: [('zh', 0.24405981600284576), ('fi', 0.16715779900550842), ('mt', 0.1392195224761963), ('et', 0.10675894469022751), ('sl', 0.07787516713142395), ('en', 0.05285739526152611)......]
I understand that the text is quite short and it may return languages other that English, but Chinese?
The original model is error-prone on short texts, as you say this is clearly a bug though.