charabia
charabia copied to clipboard
[Fork tracking] Chinese Segmenter enhancements
This PR is not meant to be merged. This PR is here to easily follow the enhancement made on https://github.com/lzw65/charabia
hi I looked through zhiwu's code and the use of a traditional to simplified normalizer is very smart; I was just wondering if there's a timeline for when this work will get merged? Thanks for your work!
Hello @Kimeiga, I don't know if the traditional to simplified normalizer is relevant because the kvariant table already makes the relation between these two. Moreover, character_converter has some performance issues and is not maintained by a native Chinese speaker. I'm wondering but for me, the real enhancement is to handle pinyin differently and be able to generate Chinese specialized ngrams. 🤔 The issue is that is not an easy problem to tackle. 😞