firefox-translations-training
firefox-translations-training copied to clipboard
Adjust data cleaning for CJK
Use custom OpusCleaner configs with disabled word-based filters.
The filters are copied from https://github.com/hplt-project/HPLT-MT-Models/blob/main/v1.0/data/en-zh_hant/raw/v2/HPLT-v1.1.en-zh_hant.filters.json.
I don't think it's feasible to do the src-trg-ratio that requires tokenization now. We would have to move tokenization to a separate step for that and somehow adjust the cleaning step to work with that instead of the original text. I filed https://github.com/mozilla/firefox-translations-training/issues/899
closes #742