datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Dataset infos in yaml

Open lhoestq opened this issue 2 years ago • 4 comments

To simplify the addition of new datasets, we'd like to have the dataset infos in the YAML and deprecate the dataset_infos.json file. YAML is readable and easy to edit, and the YAML metadata of the readme already contain dataset metadata so we would have everything in one place.

To be more specific, I moved these fields from DatasetInfo to the YAML:

  • config_name (if there are several configs)
  • download_size
  • dataset_size
  • features
  • splits

Here is what I ended up with for squad:

dataset_infos:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: answer_start
      dtype: int32
  splits:
  - name: train
    num_bytes: 79346360
    num_examples: 87599
  - name: validation
    num_bytes: 10473040
    num_examples: 10570
  download_size: 35142551
  dataset_size: 89819400

and it can be a list if there are several configs

I already did the change for conll2000 and crime_and_punish as an example.

Implementation details

Load/Read

This is done via DatasetInfoDict.write_to_directory/from_directory

I had to implement a custom the YAML export logic for SplitDict, Version and Features. The first two are trivial, but the logic for Features is more complicated, because I added a simplification step (or the YAML would be too long and less readable): it's just a formatting step to remove unnecessary nesting of YAML data.

Other changes

I had to update the DatasetModule factories to also download the README.md alongside the dataset scripts/data files, and not just the dataset_infos.json

YAML validation

I removed the old validation code that was in metadata.py, now we can just use the Hub YAML validation

Datasets-cli

The datasets-cli test --save_infos command now creates a README.md file with the dataset_infos in it, instead of a datasets_infos.json file

Backward compatibility

dataset_infos.json files are still supported and loaded if they exist to have full backward compatibility. Though I removed the unnecessary keys when the value is the default (like all the id: null from the Value feature types) to make them easier to read.

TODO

  • [x] add comments
  • [x] tests
  • [ ] document the new YAML fields (to be done in the Hub docs)
  • [x] try to reload the new dataset_infos.json file content with an old version of datasets

EDITS

  • removed "config_name" when there's only one config
  • removed "version" for now (?), because it's not useful in general

Fix https://github.com/huggingface/datasets/issues/4876

lhoestq avatar Sep 02 '22 16:09 lhoestq

The documentation is not available anymore as the PR was closed or merged.

Alright this is ready for review :) I mostly would like your opinion on the YAML structure and what we can do in the docs (IMO we can add the docs about those fields in the Hub docs). Other than that let me know if the changes in info.py and features.py look good to you

lhoestq avatar Sep 12 '22 17:09 lhoestq

LGTM and looking forward to having this merged!! ❤️

julien-c avatar Sep 20 '22 09:09 julien-c

We plan to do a release today, we'll merge this after the release :)

EDIT: actually tomorrow

lhoestq avatar Sep 20 '22 09:09 lhoestq

Created https://github.com/huggingface/datasets/pull/5018 where I added the YAML dataset_info of every single dataset in this repo

see other dataset cards: imagenet-1k, glue, flores, gem

lhoestq avatar Sep 23 '22 18:09 lhoestq

Took your comments into account and updated push_to_hub to push the dataset_info to the README.md instead of json :) Let me know if it sounds good to you now !

lhoestq avatar Sep 30 '22 14:09 lhoestq