echogarden icon indicating copy to clipboard operation
echogarden copied to clipboard

Finish development of new text language detection engine

Open rotemdan opened this issue 2 years ago • 0 comments

The current two engines (tinyld and fasttext) aren't always accurate and sometime produce odd or nonsensical classifications, like classifying English text as Klingon.

I've developed a custom engine, based on N-grams and naïve Bayes inference, with high accuracy, supporting more than 100 languages. However, the work isn't fully ready yet.

Things left to be done:

  • Compile and optimize model data as a single binary file
  • Only include shorter N-grams for some easy-to-classify languages like Chinese and Japanese
  • Decide if to include less common languages, or how to produce smaller and larger variants of the model with varying sets of supported languages

rotemdan avatar Aug 25 '23 06:08 rotemdan