firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Adjust data cleaning for CJK

Open eu9ene opened this issue 4 months ago • 1 comments

Use custom OpusCleaner configs with disabled word-based filters.

The filters are copied from https://github.com/hplt-project/HPLT-MT-Models/blob/main/v1.0/data/en-zh_hant/raw/v2/HPLT-v1.1.en-zh_hant.filters.json.

I don't think it's feasible to do the src-trg-ratio that requires tokenization now. We would have to move tokenization to a separate step for that and somehow adjust the cleaning step to work with that instead of the original text. I filed https://github.com/mozilla/firefox-translations-training/issues/899

closes #742

eu9ene avatar Oct 23 '24 21:10 eu9ene