langid.py
langid.py copied to clipboard
Mixed languages polarized by "en"
I have the following text
that is a mix of english
and sesotho
:
>>>Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama
Having the whole sentence as-it-is I get a wrong identification:
>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama
('en', -233.38300132751465)
Removing the whole en
sentences, and I get the right ISO-639-1
language code: sl
:
>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino
('sl', -154.34662437438965)
Also keeping only one en
sentence, the right language is recognized:
>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom
('sl', -217.8226833343506)
So, it seems that the detector is being "polarized" by the en
sentences in this phrase.
One would guess that this is due to bias from training.