firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

The old cleaning script breaks on small datasets

Open gregtatum opened this issue 10 months ago • 2 comments

Dataset                        Code                                   Sentences Size     URL
────────────────────────────── ────────────────────────────────────── ───────── ──────── ──────────────────────────────────────────────────────────
ELRC-Museus_2007               opus_ELRC-Museus_2007/v1               125       7.2 kB   https://opus.nlpl.eu/ELRC-Museus_2007-v1.php
ELRC-Localidades_2007          opus_ELRC-Localidades_2007/v1          101       8.2 kB   https://opus.nlpl.eu/ELRC-Localidades_2007-v1.php
ELRC-2638-monumentos_2007      opus_ELRC-2638-monumentos_2007/v1      17        8.2 kB   https://opus.nlpl.eu/ELRC-2638-monumentos_2007-v1.php
ELRC-2614-Localidades_2007     opus_ELRC-2614-Localidades_2007/v1     10        8.2 kB   https://opus.nlpl.eu/ELRC-2614-Localidades_2007-v1.php
ELRC-2612-Artigos_visitportuga opus_ELRC-2612-Artigos_visitportuga/v1 9         7.2 kB   https://opus.nlpl.eu/ELRC-2612-Artigos_visitportuga-v1.php
ELRC-2616-Museus_2007          opus_ELRC-2616-Museus_2007/v1          8         7.2 kB   https://opus.nlpl.eu/ELRC-2616-Museus_2007-v1.php
ELRC-Artigos_visitportuga      opus_ELRC-Artigos_visitportuga/v1      6         7.2 kB   https://opus.nlpl.eu/ELRC-Artigos_visitportuga-v1.php
ELRC-2480-Estatuto_dos_Deputad opus_ELRC-2480-Estatuto_dos_Deputad/v1 2         1.0 kB   https://opus.nlpl.eu/ELRC-2480-Estatuto_dos_Deputad-v1.php
ELRC-2481-Constituio_da_Repbli opus_ELRC-2481-Constituio_da_Repbli/v1 2         1.0 kB   https://opus.nlpl.eu/ELRC-2481-Constituio_da_Repbli-v1.php

https://firefox-ci-tc.services.mozilla.com/tasks/groups/eUqbJW9rQOeiP8o1ctHpPw

All of these failed with no error message on:

parallel --no-notice --pipe -k -j 8 --block 50M 'python3 -Wi tools/langid_fasttext.py -f 1 | python3 -Wi tools/langid_fasttext.py -f 1'

I'm assuming because the dataset was so small. I included all available data for a Catalan experiment, and it busted the pipeline. I can just remove these to work around it.

gregtatum avatar Apr 02 '24 20:04 gregtatum

I think we should not use those legacy scripts and use only OpusCleaner. It would make sense to enable it by default at this point.

eu9ene avatar Apr 02 '24 21:04 eu9ene

This will go away with #569, and isn't active in production anymore.

gregtatum avatar May 06 '24 19:05 gregtatum