firefox-translations-training
firefox-translations-training copied to clipboard
[Experiment] Data cleaning Apr 2024
Experiment insights
OpusCleaner
- legacy cleaning slightly outperforms all OpusCleaner configs (likely due to num_mismatch filter in OpusCleaner)
- large FastText model significantly reduces false positives compared to small one
- FastText can remove a lot of useful data on cleaner datasets, especially short phrases
- alpha ratio filter can remove useful data on cleaner datasets
- custom OpusCleaner configs slightly outperform the default one
- custom OpusCleaner configs + bicleaner significantly outperform the default one + bicleaner (+5M useful sentences due to removing some cleaning rules)
OpusFilter:
- a similar to OpusCleaner config in OpusFilter with auto-tuning performs a lot worse than the OpusCleaner one (likely due to the difference in filters)
- OpusFilter with LASER and autotuning performs better than without it but still worse than OpusCleaner (Helsinki folks pointed out that there's a bug in sampling with LASER)
- Autotuning with only basic OpusCleaner like filters (no bicleaner or laser) performs better than the OpusCleaner like defaults and better than autotuning with disabled feature selection. Mostly because it trained longer and had more data
- Autotuning with enabled LASER and BicleanerAI filters way too much data and underperforms
- Autotuned and defaults based OpusCleaner like rules do not outperform OpusCleaner defaults baseline (likely difference in fast text implementation)
- (TODO) tune laser and bicleaner separately
Bicleaner AI
- I deployed OpusCleaner on GPU with Bicleaner AI support, it's a little slow but works
- it's very hard to tune bicleaner thresholds in OpusCleaner
- Manual analysis of score distributions and example in Jupyter show that even with 0.9 there are plenty of incorrect translations
- Experiment with 0.5 vs 0.8 vs 0.9 for all datasets. 0.8 slightly outperforms 0.5, 0.9 filters too much but also competitive
LASER
- also hard to tune in OpusCleaner
- LASER 2/3 is slower than LASER 1, requires GPU
More questions to explore:
LASER embedding similarity filter:
- What's the impact of LASER filter?
- Can LASER be useful together with Bicleaner-AI?
- Does LASER 2/3 significantly outperform LASER 1?
Bilcleaner-AI:
- Will customizing the thresholds for large datasets boost performance?
Setup
en-ru
pair, all data except CCMatrix/NLLB, training backward model (ru-en
)
Example config:
datasets:
# all except ccmatrix and nllb to test filtering
train:
- opus_Books/v1
- opus_CCAligned/v1
- opus_ELRC-3075-wikipedia_health/v1
- opus_ELRC-3855-SWPS_University_Soci/v1
- opus_ELRC-5067-SciPar/v1
- opus_ELRC-5183-SciPar_Ukraine/v1
- opus_ELRC-wikipedia_health/v1
- opus_ELRC_2922/v1
- opus_EUbookshop/v2
- opus_GNOME/v1
- opus_GlobalVoices/v2018q4
- opus_KDE4/v2
- opus_LinguaTools-WikiTitles/v2014
- opus_NeuLab-TedTalks/v1
- opus_News-Commentary/v16
- opus_OpenSubtitles/v2018
- opus_PHP/v1
- opus_ParaCrawl/v9
- opus_QED/v2.0a
- opus_TED2013/v1.1
- opus_TED2020/v1
- opus_Tanzil/v1
- opus_Tatoeba/v2023-04-12
- opus_TildeMODEL/v2018
- opus_UNPC/v1.0
- opus_Ubuntu/v14.10
- opus_WikiMatrix/v1
- opus_WikiTitles/v3
- opus_Wikipedia/v1.0
- opus_XLEnt/v1.2
- opus_ada83/v1
- opus_bible-uedin/v1
- opus_infopankki/v1
- opus_tico-19/v2020-10-28
- opus_tldr-pages/v2023-08-29
- opus_wikimedia/v20230407
- mtdata_Statmt-commoncrawl_wmt13-1-rus-eng
- mtdata_Statmt-news_commentary_wmt18-13-rus-eng
- mtdata_Tilde-airbaltic-1-eng-rus
- mtdata_Tilde-czechtourism-1-eng-rus
- mtdata_Tilde-worldbank-1-eng-rus
- mtdata_UN-un_dev-1-eng-rus
- mtdata_UN-un_test-1-eng-rus
# datasets to merge for validation while training
devtest:
- flores_dev
- sacrebleu_aug-mix_wmt19
- sacrebleu_aug-mix_wmt17
- sacrebleu_aug-mix_wmt15
- sacrebleu_aug-mix_wmt14
# datasets for evaluation
test:
- flores_devtest
- sacrebleu_wmt20
- sacrebleu_wmt20
- sacrebleu_wmt18
- sacrebleu_wmt16
- sacrebleu_wmt13
# monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
# to be translated by the teacher model
mono-src:
- news-crawl_news.2008
# to be translated by the backward model to augment teacher corpus with back-translations
# leave empty to skip augmentation step (high resource languages)
mono-trg:
- news-crawl_news.2008
experiment:
src: en
trg: ru
name: opuscleaner_custom_laser_bicleaner
vocab: NOT-YET-SUPPORTED
bicleaner:
default-threshold: 0.5
dataset-thresholds: {}
best-model: chrf
split-length: 2000000
backward-model: NOT-YET-SUPPORTED
spm-sample-size: 10000000
spm-vocab-size: 32000
teacher-ensemble: 1
mono-max-sentences-src: 500000000
mono-max-sentences-trg: 500000000
use-opuscleaner: 'true'
marian-args:
decoding-teacher:
precision: float16
mini-batch-words: '4000'
training-student:
early-stopping: '20'
decoding-backward:
beam-size: '8'
mini-batch-words: '2000'
training-backward:
after: 10e
training-teacher:
early-stopping: '20'
training-student-finetuned:
early-stopping: '20'
taskcluster:
split-chunks: 10
target-stage: train-backwards