gkseg icon indicating copy to clipboard operation
gkseg copied to clipboard

Handling non-Chinese characters

Open mountain opened this issue 12 years ago • 0 comments

Today most of the Chinese text may contain English keywords or keywords from other language. We should provide a mechanism to handle non-Chinese characters. A simple solution is described as below:

  • replace the character interval of non-Chinese to a placeholder
  • then segment the text with placeholder into words, placeholder will be kept as unchanged.
  • replace the placeholder back into non-Chinese intervals

This solution will keep the non-Chinese intervals unchanged.

mountain avatar Jun 21 '12 15:06 mountain