Incorrect detection

Open debuggio opened this issue 1 year ago • 1 comments

Hi guys! I use py3langid==0.2.2 and I found that in some cases Chinese language has higher probability than it probably should be. For example

identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
identifier.rank("Al furjan")

outputs: [('zh', 0.24405981600284576), ('fi', 0.16715779900550842), ('mt', 0.1392195224761963), ('et', 0.10675894469022751), ('sl', 0.07787516713142395), ('en', 0.05285739526152611)......]

I understand that the text is quite short and it may return languages other that English, but Chinese?

Nov 26 '24 06:11 debuggio

The original model is error-prone on short texts, as you say this is clearly a bug though.

Dec 02 '24 12:12 adbar