[Experiment] Chinese Traditional
@ZJaume FYI these are the hacks I'm adding to train zh-hant. They are not meant to be landed as is to main.
mono hplt was filtered down to just 29M lines. (52 on import and everything else with cleaning). I wonder if it's too agressive. I tried fasttext model manually and it makes mistakes quite often. import, stats, cleaning
Only 24M of parallel corpus examples left (stats, group)
When I trained zh-en in Simplified, corpus size was 86M (stats, group)
@ZJaume do you have ideas why such a big difference? Can it be the new nllb fasttext model? We currently identify only zh code with it without the script. We do the conversion from Simplified to Traditional before that on data import step. I wonder if the model didn't like the converted texts.
NLLB was filtered from 71M to 24M 61M from 71M are examples that were converted from Simplified to Traditional.
So, regarding mono hplt this seems to be the cleaning summary:
22735836 HPLT_LID_SEGMENT
29798622 DUPLICATE
2028464 CLEAN_LID
3920256 RATIO_ALPHA
3683762 RATIO_CHARS
11074337 TOO_LONG
things that I think may be addressed:
- Relax too long, which is limited to 150 chars in the case of CJK and it seems that debug.txt lines discarded by this rule are not that long to be discarded, I would say.
- Ratio alpha discards a significant number of sentences that are actually only or almost only "alpha", so either
opuscleaner.filters.clean_commonregex for Hant is missing those characters, or they are simplified characters that LID is not discarding, or mixed trad and simplified scripts, or they are characters from both scripts and the regex only includes traditional (?). I don't really know what to do with this. - The duplicates I believe we cannot do anything about it. Maybe for Chinese there's more boilerplate than other languages in HPLT because the text extractor is not good at it.
- The HPLT LID by segment condition could be relaxed and accept Cantonese. I know that Cantonese and Mandarin in traditional are usually difficult to distinguish by langid models. We could trust only document-level LID and not segment-level for being unreliable. However, I don't know if that's a safe decision or not. If you do this, take into account that HPLT v3 has a mistake and
yue_Hantiszho_Hantin theseg_langsfield. - I've now realised we didn't change the LID cleaning for monolingual to use the NLLB model. But it seems that this model is pretty bad distinguishing Cantonese from Mandarin Traditional. So, maybe for this specific case you could use OpenLID-v2 or also being permissive and allow every sentence classified as
yue.
Maybe all these things are useful to you for parallel data, but I haven't had time to look at it. Will do on Monday.
eflomal fails with OOM on 200M back-translated corpus https://firefox-ci-tc.services.mozilla.com/tasks/UJbEJNKJQrKGe7UFiqDrQQ/runs/3/logs/public/logs/live.log I wonder if it's because the vocab is more diverse for zh-hant, or maybe back-translation is of low quality.
Ok, so I took a look at parallel and it's a bit difficult to tell what's happening with the cleaning without debugging log for filters. Probably it was in part because of lid218e model, let's see what comes out of it after the change to Openlid.
I think that maybe it is a good option to take a step back and start from parallel data that's originally written in Hant, then adding converted data and see if it improves. As far as I can tell, we are only using HPLT as "native" corpus and the rest is converted. There are only a few more in opus but training only on that can take the conversion out of the equation of what's hurting quality. Also, a small but maybe not that small thing is that it seems we are converting everything to Traditional here
https://github.com/mozilla/translations/blob/a3406bc03c8bbcaedaef1ff467667805d942314a/pipeline/data/cjk.py#L65-L71
and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH. Then there's the hanzidentifier.MIXED case, but converting in that case might be right choice.
It's 71M sentences of parallel corpus vs 24M before, after switching to openlid-v2 and relaxing length a little. I think the main issue was in lang id.
And 86M of HPLT mono after implementing your suggestions, so we're probably good.
and there is at least one detection case that we shouldn't convert, which is hanzidentifier.BOTH
maybe, but it shouldn't hurt hypothetically
The corpus size is 84M now, after adding OPUS zh_TW, but the backward model training chrF validation curve is slightly lower https://wandb.ai/moz-translations/zh-en?nw=nwuserepavlov
teacher COMET flores 86.72 vs google 87.15