charabia icon indicating copy to clipboard operation
charabia copied to clipboard

[Fork tracking] Chinese Segmenter enhancements

Open ManyTheFish opened this issue 1 year ago • 2 comments

This PR is not meant to be merged. This PR is here to easily follow the enhancement made on https://github.com/lzw65/charabia

ManyTheFish avatar Jan 09 '24 09:01 ManyTheFish

hi I looked through zhiwu's code and the use of a traditional to simplified normalizer is very smart; I was just wondering if there's a timeline for when this work will get merged? Thanks for your work!

Kimeiga avatar Jan 11 '24 07:01 Kimeiga

Hello @Kimeiga, I don't know if the traditional to simplified normalizer is relevant because the kvariant table already makes the relation between these two. Moreover, character_converter has some performance issues and is not maintained by a native Chinese speaker. I'm wondering but for me, the real enhancement is to handle pinyin differently and be able to generate Chinese specialized ngrams. 🤔 The issue is that is not an easy problem to tackle. 😞

ManyTheFish avatar Jan 16 '24 10:01 ManyTheFish