datasets
datasets copied to clipboard
Dataset infos in yaml
To simplify the addition of new datasets, we'd like to have the dataset infos in the YAML and deprecate the dataset_infos.json file. YAML is readable and easy to edit, and the YAML metadata of the readme already contain dataset metadata so we would have everything in one place.
To be more specific, I moved these fields from DatasetInfo to the YAML:
- config_name (if there are several configs)
- download_size
- dataset_size
- features
- splits
Here is what I ended up with for squad
:
dataset_infos:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
sequence:
- name: text
dtype: string
- name: answer_start
dtype: int32
splits:
- name: train
num_bytes: 79346360
num_examples: 87599
- name: validation
num_bytes: 10473040
num_examples: 10570
download_size: 35142551
dataset_size: 89819400
and it can be a list if there are several configs
I already did the change for conll2000
and crime_and_punish
as an example.
Implementation details
Load/Read
This is done via DatasetInfoDict.write_to_directory/from_directory
I had to implement a custom the YAML export logic for SplitDict
, Version
and Features
.
The first two are trivial, but the logic for Features
is more complicated, because I added a simplification step (or the YAML would be too long and less readable): it's just a formatting step to remove unnecessary nesting of YAML data.
Other changes
I had to update the DatasetModule factories to also download the README.md alongside the dataset scripts/data files, and not just the dataset_infos.json
YAML validation
I removed the old validation code that was in metadata.py, now we can just use the Hub YAML validation
Datasets-cli
The datasets-cli test --save_infos
command now creates a README.md file with the dataset_infos in it, instead of a datasets_infos.json file
Backward compatibility
dataset_infos.json
files are still supported and loaded if they exist to have full backward compatibility.
Though I removed the unnecessary keys when the value is the default (like all the id: null
from the Value feature types) to make them easier to read.
TODO
- [x] add comments
- [x] tests
- [ ] document the new YAML fields (to be done in the Hub docs)
- [x] try to reload the new dataset_infos.json file content with an old version of
datasets
EDITS
- removed "config_name" when there's only one config
- removed "version" for now (?), because it's not useful in general
Fix https://github.com/huggingface/datasets/issues/4876
The documentation is not available anymore as the PR was closed or merged.
Alright this is ready for review :) I mostly would like your opinion on the YAML structure and what we can do in the docs (IMO we can add the docs about those fields in the Hub docs). Other than that let me know if the changes in info.py and features.py look good to you
LGTM and looking forward to having this merged!! ❤️
We plan to do a release today, we'll merge this after the release :)
EDIT: actually tomorrow
Created https://github.com/huggingface/datasets/pull/5018 where I added the YAML dataset_info
of every single dataset in this repo
see other dataset cards: imagenet-1k, glue, flores, gem
Took your comments into account and updated push_to_hub
to push the dataset_info to the README.md instead of json :) Let me know if it sounds good to you now !