gkseg
gkseg copied to clipboard
Handling non-Chinese characters
Today most of the Chinese text may contain English keywords or keywords from other language. We should provide a mechanism to handle non-Chinese characters. A simple solution is described as below:
- replace the character interval of non-Chinese to a placeholder
- then segment the text with placeholder into words, placeholder will be kept as unchanged.
- replace the placeholder back into non-Chinese intervals
This solution will keep the non-Chinese intervals unchanged.