echogarden
echogarden copied to clipboard
Finish development of new text language detection engine
The current two engines (tinyld and fasttext) aren't always accurate and sometime produce odd or nonsensical classifications, like classifying English text as Klingon.
I've developed a custom engine, based on N-grams and naïve Bayes inference, with high accuracy, supporting more than 100 languages. However, the work isn't fully ready yet.
Things left to be done:
- Compile and optimize model data as a single binary file
- Only include shorter N-grams for some easy-to-classify languages like Chinese and Japanese
- Decide if to include less common languages, or how to produce smaller and larger variants of the model with varying sets of supported languages