firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

[Experiment] Chinese Traditional

Open evgenyrp opened this issue 5 months ago • 10 comments

Training dashboard

evgenyrp avatar Nov 18 '25 21:11 evgenyrp

@ZJaume FYI these are the hacks I'm adding to train zh-hant. They are not meant to be landed as is to main.

evgenyrp avatar Nov 19 '25 00:11 evgenyrp

mono hplt was filtered down to just 29M lines. (52 on import and everything else with cleaning). I wonder if it's too agressive. I tried fasttext model manually and it makes mistakes quite often. import, stats, cleaning

evgenyrp avatar Nov 20 '25 18:11 evgenyrp

Only 24M of parallel corpus examples left (stats, group)

When I trained zh-en in Simplified, corpus size was 86M (stats, group)

@ZJaume do you have ideas why such a big difference? Can it be the new nllb fasttext model? We currently identify only zh code with it without the script. We do the conversion from Simplified to Traditional before that on data import step. I wonder if the model didn't like the converted texts.

NLLB was filtered from 71M to 24M 61M from 71M are examples that were converted from Simplified to Traditional.

evgenyrp avatar Nov 20 '25 19:11 evgenyrp

So, regarding mono hplt this seems to be the cleaning summary:

22735836 HPLT_LID_SEGMENT
29798622 DUPLICATE
2028464 CLEAN_LID
3920256 RATIO_ALPHA
3683762 RATIO_CHARS
11074337 TOO_LONG

things that I think may be addressed:

  • Relax too long, which is limited to 150 chars in the case of CJK and it seems that debug.txt lines discarded by this rule are not that long to be discarded, I would say.
  • Ratio alpha discards a significant number of sentences that are actually only or almost only "alpha", so either opuscleaner.filters.clean_common regex for Hant is missing those characters, or they are simplified characters that LID is not discarding, or mixed trad and simplified scripts, or they are characters from both scripts and the regex only includes traditional (?). I don't really know what to do with this.
  • The duplicates I believe we cannot do anything about it. Maybe for Chinese there's more boilerplate than other languages in HPLT because the text extractor is not good at it.
  • The HPLT LID by segment condition could be relaxed and accept Cantonese. I know that Cantonese and Mandarin in traditional are usually difficult to distinguish by langid models. We could trust only document-level LID and not segment-level for being unreliable. However, I don't know if that's a safe decision or not. If you do this, take into account that HPLT v3 has a mistake and yue_Hant is zho_Hant in the seg_langs field.
  • I've now realised we didn't change the LID cleaning for monolingual to use the NLLB model. But it seems that this model is pretty bad distinguishing Cantonese from Mandarin Traditional. So, maybe for this specific case you could use OpenLID-v2 or also being permissive and allow every sentence classified as yue.

Maybe all these things are useful to you for parallel data, but I haven't had time to look at it. Will do on Monday.

ZJaume avatar Nov 21 '25 17:11 ZJaume

eflomal fails with OOM on 200M back-translated corpus https://firefox-ci-tc.services.mozilla.com/tasks/UJbEJNKJQrKGe7UFiqDrQQ/runs/3/logs/public/logs/live.log I wonder if it's because the vocab is more diverse for zh-hant, or maybe back-translation is of low quality.

evgenyrp avatar Nov 25 '25 23:11 evgenyrp

Ok, so I took a look at parallel and it's a bit difficult to tell what's happening with the cleaning without debugging log for filters. Probably it was in part because of lid218e model, let's see what comes out of it after the change to Openlid.

I think that maybe it is a good option to take a step back and start from parallel data that's originally written in Hant, then adding converted data and see if it improves. As far as I can tell, we are only using HPLT as "native" corpus and the rest is converted. There are only a few more in opus but training only on that can take the conversion out of the equation of what's hurting quality. Also, a small but maybe not that small thing is that it seems we are converting everything to Traditional here

https://github.com/mozilla/translations/blob/a3406bc03c8bbcaedaef1ff467667805d942314a/pipeline/data/cjk.py#L65-L71

and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH. Then there's the hanzidentifier.MIXED case, but converting in that case might be right choice.

ZJaume avatar Nov 27 '25 11:11 ZJaume

It's 71M sentences of parallel corpus vs 24M before, after switching to openlid-v2 and relaxing length a little. I think the main issue was in lang id.

And 86M of HPLT mono after implementing your suggestions, so we're probably good.

evgenyrp avatar Nov 27 '25 18:11 evgenyrp

and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH

maybe, but it shouldn't hurt hypothetically

evgenyrp avatar Nov 27 '25 18:11 evgenyrp

The corpus size is 84M now, after adding OPUS zh_TW, but the backward model training chrF validation curve is slightly lower https://wandb.ai/moz-translations/zh-en?nw=nwuserepavlov

evgenyrp avatar Dec 04 '25 19:12 evgenyrp

teacher COMET flores 86.72 vs google 87.15

evgenyrp avatar Dec 18 '25 00:12 evgenyrp