NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Japanese support for get_word_splitter

Open oumugai opened this issue 1 year ago • 0 comments
trafficstars

Is your feature request related to a problem? I am frustrated that the get_word_splitter function does not handle Japanese text correctly. For example, Japanese does not have spaces between words, so the current logic does not accurately determine word separations. This can result in inaccurate results being returned for certain natural language processing tasks.

Describe the solution you'd like We would like to implement get_word_splitter specifically for Japanese. Japanese word splitting requires morphological analysis, so libraries such as Kuromoji.js and MeCab should be used to properly split Japanese text.

oumugai avatar Sep 23 '24 07:09 oumugai