ml-classify-text-js icon indicating copy to clipboard operation
ml-classify-text-js copied to clipboard

Support for tokenization of languages without spaces

Open andreekeberg opened this issue 2 years ago • 0 comments

Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).

Some of these languages include:

  • Chinese
  • Japanese
  • Thai
  • Khmer
  • Lao
  • Burmese

andreekeberg avatar Jul 24 '21 01:07 andreekeberg