sentencepiece
sentencepiece copied to clipboard
Tokenization for phonetic languages
Hi,
Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.
Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura) In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens
Thanks & Regards, Divyesh Rajpura