firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Investigate word-based filtering for CJK

Open eu9ene opened this issue 4 months ago • 1 comments

Nikolay: Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 93 in 3b3f33b ratio_len = src_len / float(trg_len)

Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

Japanese tokenizer should be used in place of jieba for japanese

eu9ene avatar Oct 23 '24 21:10 eu9ene