langid.py icon indicating copy to clipboard operation
langid.py copied to clipboard

Mixed languages polarized by "en"

Open loretoparisi opened this issue 8 years ago • 1 comments

I have the following text

that is a mix of english and sesotho:

>>>Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama

Having the whole sentence as-it-is I get a wrong identification:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama
('en', -233.38300132751465)

Removing the whole en sentences, and I get the right ISO-639-1 language code: sl:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino                                                       
('sl', -154.34662437438965)

Also keeping only one en sentence, the right language is recognized:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom                    
('sl', -217.8226833343506)

So, it seems that the detector is being "polarized" by the en sentences in this phrase.

loretoparisi avatar Oct 19 '16 20:10 loretoparisi

One would guess that this is due to bias from training.

sixtyfive avatar Sep 14 '21 13:09 sixtyfive