langid.py
langid.py copied to clipboard
Stand-alone language identification system
Hello, Could someone tell me how to normalize the values in case we work on a notebook? I would like an equivalent of "python langid.py -n ..." for the notebook...
Closes #75
There is a small typo in langid/train/DFfeatureselect.py. Should read `overridden` rather than `overriden`.
Is it possible to show the trained data model in https://raw.githubusercontent.com/saffsd/langid.py/master/langid/langid.py as a pure JSON file for easy porting to other libraries that does the same thing?
Where did you get the data from? And what languages are covered and by what ratio? - JRC-Acquis - ClueWeb 09 - Wikipedia - Reuters RCV2 - Debian i18n
Solve the following scenarios: In the process of identification, the scope of the language can be limited.
Is there a way to get a list of supported/currently set languages? I'm thinking programatically, like: import langid print(langid.get_languages() >>> ['af', 'am', 'an', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'br',...
As written in README, langid.py comes pre-trained on 97 languages. How could I reproduce the conclusion? I gave a try for UG language, but it told me it's ZH. I...
It's understandable that performance for very short strings is poor. Could we create a mapping with hand-assigned weights for those? I believe strings like 'yeah', 'no', 'si', 'haha', 'hehe' and...
For example, the sentence "Presidencia de la República - Mexico", the word "de" will be classified as "en", but if i change it to " de ", as add space...