firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Add bcp 47 code support in mtdata importer.

Open khoisan25 opened this issue 2 years ago • 2 comments

mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading

- mtdata_Statmt-ccaligned-1-eng-zho_CN

Results in: Statmt-ccaligned-1-eng-zho_CN.eng.gz and Statmt-ccaligned-1-eng-zho_CN.zho_CN.gz

  • Note extension .zho_CN.gz

Current mtdata importer assumes dataset is ISO 639-3 and does not check for script or region in output file resulting in the following.

mv .../Statmt-ccaligned-1-eng-zho_CN.zho.gz .../mtdata_Statmt-ccaligned-1-eng-zho_CN.zh.gz mv: cannot stat '.../train-parts/Statmt-ccaligned-1-eng-zho_CN.zho.gz': No such file or directory

khoisan25 avatar Feb 28 '22 22:02 khoisan25

I was just about to open the same bug report. +1

XapaJIaMnu avatar Feb 28 '22 23:02 XapaJIaMnu

I think this is still valid. I'm guessing our task will fail in Taskcluster if and when it comes up. We only need to fix it when a dataset triggers it though.

gregtatum avatar Apr 09 '24 21:04 gregtatum