ml-classify-text-js
ml-classify-text-js copied to clipboard
Support for tokenization of languages without spaces
Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).
Some of these languages include:
- Chinese
- Japanese
- Thai
- Khmer
- Lao
- Burmese