charabia icon indicating copy to clipboard operation
charabia copied to clipboard

Chinese segmentation not correct

Open sivdead opened this issue 2 years ago • 2 comments
trafficstars

I notice that this program use jieba.cut to cut Chinese words,but it seems not works well at some time; egg,use Chinese word 永永远远是龙的传人,jieba.cut will result to 永永远远/是/龙的传人, but when use jieba.cut_for_search, the result would be 永远/远远/永永远远/是/传人/龙的传人, I think its better for index search.

sivdead avatar Jul 05 '23 08:07 sivdead

I can make a pr to solve this if you do think this should be fixed.

sivdead avatar Jul 05 '23 08:07 sivdead

Hello @sivdead, you're right, using cut_for_search would increase the recall of Meilisearch by splitting words in different ways. However, Meilisearch relies on words position for queries, and Jieba.cut_for_search doesn't give any clues on the position of each token, moreover, charabia does not support shifting tokens. In order to support this kind of position shifting behavior, the charabia output should be changed in a tree shape for instance 永永远远是龙的传人 would be shaped as:

永永远远 ──┬─► 是 ─┬─► 龙的传人
永远 ─────┤       └─► 传人
远远 ─────┘

Which is not possible without doing a huge job, But I have to admit that it would enhance significantly the search recall.

Thank you for your report and sorry for the time to answer,

see you!

ManyTheFish avatar Jul 13 '23 17:07 ManyTheFish