firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Monolingual data has a word splitter that won't work for CJK

Open gregtatum opened this issue 1 year ago • 1 comments

Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation strategy for CJK languages, maybe just a byte limit.

gregtatum avatar Feb 06 '24 17:02 gregtatum