language-detector
language-detector copied to clipboard
Add more languages
Hey can you add more Indic language or can you share the pattern or the structure of subset so that I can able to add new languages as per my requirement. How to add new subset ?
Hey,
How to add new subset ?
- clone repository
- create your subset file in
src/LanguageDetector/subsets/
folder - write at least one test in
tests/LanguageDetector/LanguageDetectionTest.php
file to validate your subset - then you can push with a commit message
Add new language {the new language}
Subset structure
A subset file is a JSON encoded file with the following structure:
{
"freq":{"D":662077, [...], "tha":240340},
"n_words":[260942223,308553243,224934017],
"name":"en"
}
- freq contains a list of key => value pairs where key is the ngram and value is an integer that represents the number of occurences found in source files. LanguageDetector accepts unigrams, bigrams and trigrams.
- n_words is a serie of 3 integers that represents total number of occurences ordered by ngram size (1,2,3)
- name is the name of the language
More
A you may guess, a "learning" tool has to be written to generate a subset. It's not yet packaged with the library but might be in the future. An advise: to generate a reliable subset file, you have to collect a large number of files in the desired language and, if possible, from various language variations.
Hope this helps