Finish development of new text language detection engine

Open rotemdan opened this issue 2 years ago • 0 comments

The current two engines (tinyld and fasttext) aren't always accurate and sometime produce odd or nonsensical classifications, like classifying English text as Klingon.

I've developed a custom engine, based on N-grams and naïve Bayes inference, with high accuracy, supporting more than 100 languages. However, the work isn't fully ready yet.

Things left to be done:

Compile and optimize model data as a single binary file
Only include shorter N-grams for some easy-to-classify languages like Chinese and Japanese
Decide if to include less common languages, or how to produce smaller and larger variants of the model with varying sets of supported languages

Aug 25 '23 06:08 rotemdan