firefox-translations-training
firefox-translations-training copied to clipboard
Add bcp 47 code support in mtdata importer.
mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading
- mtdata_Statmt-ccaligned-1-eng-zho_CN
Results in:
Statmt-ccaligned-1-eng-zho_CN.eng.gz
and Statmt-ccaligned-1-eng-zho_CN.zho_CN.gz
- Note extension .zho_CN.gz
Current mtdata importer assumes dataset is ISO 639-3 and does not check for script or region in output file resulting in the following.
mv .../Statmt-ccaligned-1-eng-zho_CN.zho.gz .../mtdata_Statmt-ccaligned-1-eng-zho_CN.zh.gz
mv: cannot stat '.../train-parts/Statmt-ccaligned-1-eng-zho_CN.zho.gz': No such file or directory
I was just about to open the same bug report. +1
I think this is still valid. I'm guessing our task will fail in Taskcluster if and when it comes up. We only need to fix it when a dataset triggers it though.