firefox-translations-training
firefox-translations-training copied to clipboard
The old cleaning script breaks on small datasets
Dataset Code Sentences Size URL
────────────────────────────── ────────────────────────────────────── ───────── ──────── ──────────────────────────────────────────────────────────
ELRC-Museus_2007 opus_ELRC-Museus_2007/v1 125 7.2 kB https://opus.nlpl.eu/ELRC-Museus_2007-v1.php
ELRC-Localidades_2007 opus_ELRC-Localidades_2007/v1 101 8.2 kB https://opus.nlpl.eu/ELRC-Localidades_2007-v1.php
ELRC-2638-monumentos_2007 opus_ELRC-2638-monumentos_2007/v1 17 8.2 kB https://opus.nlpl.eu/ELRC-2638-monumentos_2007-v1.php
ELRC-2614-Localidades_2007 opus_ELRC-2614-Localidades_2007/v1 10 8.2 kB https://opus.nlpl.eu/ELRC-2614-Localidades_2007-v1.php
ELRC-2612-Artigos_visitportuga opus_ELRC-2612-Artigos_visitportuga/v1 9 7.2 kB https://opus.nlpl.eu/ELRC-2612-Artigos_visitportuga-v1.php
ELRC-2616-Museus_2007 opus_ELRC-2616-Museus_2007/v1 8 7.2 kB https://opus.nlpl.eu/ELRC-2616-Museus_2007-v1.php
ELRC-Artigos_visitportuga opus_ELRC-Artigos_visitportuga/v1 6 7.2 kB https://opus.nlpl.eu/ELRC-Artigos_visitportuga-v1.php
ELRC-2480-Estatuto_dos_Deputad opus_ELRC-2480-Estatuto_dos_Deputad/v1 2 1.0 kB https://opus.nlpl.eu/ELRC-2480-Estatuto_dos_Deputad-v1.php
ELRC-2481-Constituio_da_Repbli opus_ELRC-2481-Constituio_da_Repbli/v1 2 1.0 kB https://opus.nlpl.eu/ELRC-2481-Constituio_da_Repbli-v1.php
https://firefox-ci-tc.services.mozilla.com/tasks/groups/eUqbJW9rQOeiP8o1ctHpPw
All of these failed with no error message on:
parallel --no-notice --pipe -k -j 8 --block 50M 'python3 -Wi tools/langid_fasttext.py -f 1 | python3 -Wi tools/langid_fasttext.py -f 1'
I'm assuming because the dataset was so small. I included all available data for a Catalan experiment, and it busted the pipeline. I can just remove these to work around it.
I think we should not use those legacy scripts and use only OpusCleaner. It would make sense to enable it by default at this point.
This will go away with #569, and isn't active in production anymore.