datasets icon indicating copy to clipboard operation
datasets copied to clipboard

WMT21 & WMT22

Open Muennighoff opened this issue 2 years ago • 1 comments

Adding a Dataset

  • Name: WMT21 & WMT22
  • Description: We are going to have three tracks: two small tasks and a large task. The small tracks evaluate translation between fairly related languages and English (all pairs). The large track uses 101 languages.
  • Paper: /
  • Data: https://statmt.org/wmt21/large-scale-multilingual-translation-task.html https://statmt.org/wmt22/large-scale-multilingual-translation-task.html
  • Motivation: Many more languages than previous WMT versions - Could be very high impact

Instructions to add a new dataset can be found here.

I could also tackle this. I saw the existing logic for WMT models is a bit complex (datasets are stored on the wmt account & retrieved in separate wmt datasets afaict). How long do you think it would take me? @lhoestq

Muennighoff avatar Jul 18 '22 21:07 Muennighoff

Hi ! That would be awesome to have them indeed, thanks for opening this issue

I just added you to the WMT org on the HF Hub if you're interested in adding those datasets.

Feel free to create a dataset repository for each dataset and upload the data files there :) preferably in ZIP archives instead of TAR archives (the current WMT scripts don't support streaming TAR archives, so it would break the dataset preview). We've also had issues with the statmt.org host (data unavailable, slow download speed), that's why I think it's better if we re-host the files on the Hub.

wmt21 (and wmt22) can be added in this GitHub repository I think on the HF Hub under the WMT org (we'll move the previous ones to this org soon as well). To add it, you can copy paste the code of the previous one (e.g. wmt19), and add the new data:

  • in wmt_utils.py, add the new data subsets. You need to provide the download URLs, as well as the target and source languages
  • in wmt21.py (renamed from wmt19.py), you can specify the subsets that WMT21 uses (i.e. the one you just added)
  • in wmt_utils.py, define the python function that must be used to parse the subsets you added. To do so, you must go in _generate_examples and chose the proper sub_generator based on the subset name. For example, the paracrawl_v3 subset uses the _parse_tmx function:

https://github.com/huggingface/datasets/blob/ede72d3f9796339701ec59899c7c31d2427046fb/datasets/wmt19/wmt_utils.py#L834-L835

Hopefully the data is in a format that is already supported and there's no need to write a new _parse_* function for the new subsets. Let me know if you have questions or if I can help :)

lhoestq avatar Jul 19 '22 10:07 lhoestq

@Muennighoff , @lhoestq let me know if you want me to look into this. Happy to help bring WMT21 & WMT22 datasets into 🤗 !

srhrshr avatar Oct 05 '22 12:10 srhrshr

Hi @srhrshr :) Sure, feel free to create a dataset repository on the Hub and start from the implementation of WMT19 if you want. Then we can move the dataset under the WMT org (we'll move the other ones there as well).

Let me know if you have questions or if I can help

lhoestq avatar Oct 05 '22 13:10 lhoestq

#self-assign

Etelis avatar Jun 19 '23 21:06 Etelis

Hello @lhoestq ,

Would it be possible for me to be granted in the WMT organization (on hf ofc) in order to facilitate dataset uploads? I've already initiated the joining process at this link: https://huggingface.co/wmt

I appreciate your help with this. Thank you!

Etelis avatar Jun 20 '23 06:06 Etelis

Hi ! Cool I just added you

lhoestq avatar Jun 20 '23 09:06 lhoestq