ml-classify-text-js Support for tokenization of languages without spaces

Support for tokenization of languages without spaces

Open andreekeberg opened this issue 4 years ago • 0 comments

Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).

Some of these languages include:

Chinese
Japanese
Thai
Khmer
Lao
Burmese

Jul 24 '21 01:07 andreekeberg

ml-classify-text-js ml-classify-text-js copied to clipboard

Support for tokenization of languages without spaces

ml-classify-text-js
ml-classify-text-js copied to clipboard