sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Tokenization for phonetic languages

Open divyeshrajpura4114 opened this issue 9 months ago • 1 comments

Hi,

Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.

Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura) In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens

Thanks & Regards, Divyesh Rajpura

divyeshrajpura4114 avatar May 14 '24 05:05 divyeshrajpura4114