language-detector icon indicating copy to clipboard operation
language-detector copied to clipboard

Add more languages

Open rashmiranjanrrs opened this issue 5 years ago • 2 comments

Hey can you add more Indic language or can you share the pattern or the structure of subset so that I can able to add new languages as per my requirement. How to add new subset ?

rashmiranjanrrs avatar Dec 22 '19 06:12 rashmiranjanrrs

Hey,

How to add new subset ?

  • clone repository
  • create your subset file in src/LanguageDetector/subsets/ folder
  • write at least one test in tests/LanguageDetector/LanguageDetectionTest.php file to validate your subset
  • then you can push with a commit message Add new language {the new language}

Subset structure

A subset file is a JSON encoded file with the following structure:

{
  "freq":{"D":662077, [...], "tha":240340},
  "n_words":[260942223,308553243,224934017],
  "name":"en"
}
  • freq contains a list of key => value pairs where key is the ngram and value is an integer that represents the number of occurences found in source files. LanguageDetector accepts unigrams, bigrams and trigrams.
  • n_words is a serie of 3 integers that represents total number of occurences ordered by ngram size (1,2,3)
  • name is the name of the language

More

A you may guess, a "learning" tool has to be written to generate a subset. It's not yet packaged with the library but might be in the future. An advise: to generate a reliable subset file, you have to collect a large number of files in the desired language and, if possible, from various language variations.

Hope this helps

landrok avatar Jan 28 '20 14:01 landrok