datasets WMT21 & WMT22

Adding a Dataset

Name: WMT21 & WMT22
Description: We are going to have three tracks: two small tasks and a large task. The small tracks evaluate translation between fairly related languages and English (all pairs). The large track uses 101 languages.
Paper: /
Data: https://statmt.org/wmt21/large-scale-multilingual-translation-task.html https://statmt.org/wmt22/large-scale-multilingual-translation-task.html
Motivation: Many more languages than previous WMT versions - Could be very high impact

Instructions to add a new dataset can be found here.

I could also tackle this. I saw the existing logic for WMT models is a bit complex (datasets are stored on the wmt account & retrieved in separate wmt datasets afaict). How long do you think it would take me? @lhoestq

Jul 18 '22 21:07 Muennighoff

Hi ! That would be awesome to have them indeed, thanks for opening this issue

I just added you to the WMT org on the HF Hub if you're interested in adding those datasets.

Feel free to create a dataset repository for each dataset and upload the data files there :) preferably in ZIP archives instead of TAR archives (the current WMT scripts don't support streaming TAR archives, so it would break the dataset preview). We've also had issues with the statmt.org host (data unavailable, slow download speed), that's why I think it's better if we re-host the files on the Hub.

wmt21 (and wmt22) can be added ~~in this GitHub repository I think~~ on the HF Hub under the WMT org (we'll move the previous ones to this org soon as well). To add it, you can copy paste the code of the previous one (e.g. wmt19), and add the new data:

in wmt_utils.py, add the new data subsets. You need to provide the download URLs, as well as the target and source languages
in wmt21.py (renamed from wmt19.py), you can specify the subsets that WMT21 uses (i.e. the one you just added)
in wmt_utils.py, define the python function that must be used to parse the subsets you added. To do so, you must go in _generate_examples and chose the proper sub_generator based on the subset name. For example, the paracrawl_v3 subset uses the _parse_tmx function:

https://github.com/huggingface/datasets/blob/ede72d3f9796339701ec59899c7c31d2427046fb/datasets/wmt19/wmt_utils.py#L834-L835

Hopefully the data is in a format that is already supported and there's no need to write a new _parse_* function for the new subsets. Let me know if you have questions or if I can help :)

Jul 19 '22 10:07 lhoestq

@Muennighoff , @lhoestq let me know if you want me to look into this. Happy to help bring WMT21 & WMT22 datasets into 🤗 !

Oct 05 '22 12:10 srhrshr

Hi @srhrshr :) Sure, feel free to create a dataset repository on the Hub and start from the implementation of WMT19 if you want. Then we can move the dataset under the WMT org (we'll move the other ones there as well).

Let me know if you have questions or if I can help

Oct 05 '22 13:10 lhoestq

#self-assign

Jun 19 '23 21:06 Etelis

Hello @lhoestq ,

Would it be possible for me to be granted in the WMT organization (on hf ofc) in order to facilitate dataset uploads? I've already initiated the joining process at this link: https://huggingface.co/wmt

I appreciate your help with this. Thank you!

Jun 20 '23 06:06 Etelis

Hi ! Cool I just added you

Jun 20 '23 09:06 lhoestq

datasets datasets copied to clipboard

WMT21 & WMT22

Adding a Dataset

datasets
datasets copied to clipboard