datasets
datasets copied to clipboard
WMT21 & WMT22
Adding a Dataset
- Name: WMT21 & WMT22
- Description: We are going to have three tracks: two small tasks and a large task. The small tracks evaluate translation between fairly related languages and English (all pairs). The large track uses 101 languages.
- Paper: /
- Data: https://statmt.org/wmt21/large-scale-multilingual-translation-task.html https://statmt.org/wmt22/large-scale-multilingual-translation-task.html
- Motivation: Many more languages than previous WMT versions - Could be very high impact
Instructions to add a new dataset can be found here.
I could also tackle this. I saw the existing logic for WMT models is a bit complex (datasets are stored on the wmt account & retrieved in separate wmt datasets afaict). How long do you think it would take me? @lhoestq
Hi ! That would be awesome to have them indeed, thanks for opening this issue
I just added you to the WMT org on the HF Hub if you're interested in adding those datasets.
Feel free to create a dataset repository for each dataset and upload the data files there :) preferably in ZIP archives instead of TAR archives (the current WMT scripts don't support streaming TAR archives, so it would break the dataset preview). We've also had issues with the statmt.org
host (data unavailable, slow download speed), that's why I think it's better if we re-host the files on the Hub.
wmt21
(and wmt22) can be added in this GitHub repository I think on the HF Hub under the WMT
org (we'll move the previous ones to this org soon as well).
To add it, you can copy paste the code of the previous one (e.g. wmt19), and add the new data:
- in wmt_utils.py, add the new data subsets. You need to provide the download URLs, as well as the target and source languages
- in wmt21.py (renamed from wmt19.py), you can specify the subsets that WMT21 uses (i.e. the one you just added)
- in wmt_utils.py, define the python function that must be used to parse the subsets you added. To do so, you must go in
_generate_examples
and chose the propersub_generator
based on the subset name. For example, theparacrawl_v3
subset uses the_parse_tmx
function:
https://github.com/huggingface/datasets/blob/ede72d3f9796339701ec59899c7c31d2427046fb/datasets/wmt19/wmt_utils.py#L834-L835
Hopefully the data is in a format that is already supported and there's no need to write a new _parse_*
function for the new subsets. Let me know if you have questions or if I can help :)
@Muennighoff , @lhoestq let me know if you want me to look into this. Happy to help bring WMT21 & WMT22 datasets into 🤗 !
Hi @srhrshr :) Sure, feel free to create a dataset repository on the Hub and start from the implementation of WMT19 if you want. Then we can move the dataset under the WMT org (we'll move the other ones there as well).
Let me know if you have questions or if I can help
#self-assign
Hello @lhoestq ,
Would it be possible for me to be granted in the WMT organization (on hf ofc) in order to facilitate dataset uploads? I've already initiated the joining process at this link: https://huggingface.co/wmt
I appreciate your help with this. Thank you!
Hi ! Cool I just added you