firefox-translations-training issues

Switch bestbleu to chrF

chrF is now considered more reliable than BLEU, and should work better for CJK Based on advice from #748 + unify sacrebleu and mtdata versions everywhere closes #748

eu9ene

Check evaluation procedure for CJK

2

Does it require any adjustment? Do our metrics (chrF, COMET, BLEU) work correctly for these languages?

eu9ene

language-coverage

Use custom OpusCleaner configs with disabled word-based filters. The filters are copied from https://github.com/hplt-project/HPLT-MT-Models/blob/main/v1.0/data/en-zh_hant/raw/v2/HPLT-v1.1.en-zh_hant.filters.json. I don't think it's feasible to do the src-trg-ratio that requires tokenization now. We would have...

eu9ene

Configure vocab for CJK

- character coverage - size closes #745

eu9ene

Investigate issues with SentencePiece vocabulary for CJK

5

See comments from Jaume: https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036191497 https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036198055

eu9ene

language-coverage

Use GCP standard instances for alignment tasks

1

[skip ci]

eu9ene

Check decoding for CJK

1

Does decoding, extract-best and other procedures for translation work the same way for CJK?

eu9ene

language-coverage

Check shortlist for CJK

Does it require and modifications?

eu9ene

language-coverage

Fix shortlist pruning for CJK

I don't have a good understanding of why some lines are suddenly empty as a result of running "extract_lex". There are just a few of them and the model trained...

eu9ene

Rework wasm build scripts for gecko

2

@gregtatum The goal of this patch is to move much of the functionality from the [build-bergamot.py](https://searchfox.org/mozilla-central/rev/dca2603d55b5b39d3b8ab8e93c08b42563f5aad8/toolkit/components/translations/bergamot-translator/build-bergamot.py) script in Mozilla Central upstream into this repository to better streamline how WASM artifacts...

nordzilla

inference

firefox-translations-training
firefox-translations-training copied to clipboard

Metadata

Switch bestbleu to chrF

Check evaluation procedure for CJK

Adjust data cleaning for CJK

Configure vocab for CJK

Investigate issues with SentencePiece vocabulary for CJK

Use GCP standard instances for alignment tasks

Check decoding for CJK

Check shortlist for CJK

Fix shortlist pruning for CJK

Rework wasm build scripts for gecko

← Metadata

Owner

Metadata

firefox-translations-training firefox-translations-training copied to clipboard

Metadata

← Metadata

Owner

Metadata

firefox-translations-training
firefox-translations-training copied to clipboard