gse icon indicating copy to clipboard operation
gse copied to clipboard

In Chinese word segmentation, only a single word is separated

Open xiaominger opened this issue 2 years ago • 2 comments

Execute the following code (tabooSegmentCustomDicList there are more than 2000 words) ` for _, tabooSegmentCustomDic := range tabooSegmentCustomDicList { lowerCaseWord := strings.ToLower(tabooSegmentCustomDic.Word) segmentutil.AddWord(lowerCaseWord) }

func AddWord(word string) bool { defer recoverPanic(word) err := seg.AddToken(word, 100) if err != nil { logger.Errorf("Error when AddWord,%s", word, err) return false } return true }

func TextSegment(text string) []string { defer recoverPanic(text) return seg.Cut(text) }

`

TextSegment("api发送文本loumès 𝘾𝘼𝙍𝙏𝙄𝙀𝙍")

the result is ["api","发","送","文","本","lou","mès"," ","𝘾𝘼𝙍𝙏𝙄𝙀𝙍"]

xiaominger avatar Aug 15 '23 09:08 xiaominger

Please set 'DefaultAnalyzer' to 'cjk. AnalyzerName' will resolve the issue.

zwj186 avatar Nov 09 '23 10:11 zwj186

how to set DefaultAnalyzer , search all repo files, no find this keyword/setting

kms9 avatar Jun 15 '24 18:06 kms9