datasets Move DatasetInfo from `datasets_infos.json` to the YAML tags in `README.md`

Currently there are two places to find metadata for datasets:

datasets_infos.json, which contains per dataset config
- description
- citation
- license
- splits and sizes
- checksums of the data files
- feature types
- and more
YAML tags, which contain
- license
- language
- train-eval-index
- and more

It would be nice to have a single place instead. We can rely on the YAML tags more than the JSON file for consistency with models. And it would all be indexed by our back-end directly, which is nice to have.

One way would be to move everything to the YAML tags except the checksums (there can be tens of thousands of them). The description/citation is already in the dataset card so we probably don't need to have them in the YAML card, it would be redundant.

Here is an example for SQuAD


download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
  num_examples: 87599
  num_bytes: 79317110
- name: validation
  num_examples: 10570
  num_bytes: 10472653
features:
- name: id
  dtype: string
- name: title
  dtype: string
- name: context
  dtype: string
- name: question
  dtype: string
- name: answers
  struct:
  - name: text
    list:
      dtype: string
  - name: answer_start
    list:
      dtype: int32

Since there is only one configuration for SQuAD, this structure is ok. For datasets with several configs we can see in a second step, but IMO it would be ok to have these fields per config using another syntax

configs:
- config: unlabeled
  splits:
  - name: train
    num_examples: 10000
  features:
  - name: text
    dtype: string
- config: labeled
  splits:
  - name: train
    num_examples: 100
  features:
  - name: text
    dtype: string
  - name: label
    dtype: ClassLabel
    names:
    - negative
    - positive

So in the end you could specify a YAML tag either at the top level (for all configs) or per config in the configs field

Alternatively we could keep config specific stuff in the dataset_infos.json as it it today

Not sure yet what's the best approach here but cc @julien-c @mariosasko @albertvillanova @polinaeterna for feedback :)

Aug 23 '22 16:08 lhoestq

also @osanseviero @Pierrci @SBrandeis potentially

Aug 23 '22 19:08 julien-c

Love this in principle 🚀

Let's keep in mind users might rely on dataset_infos.json already.

I'm not convinced by the two-syntax solution, wouldn't it be simpler to have only one syntax with a default config for datasets with only one config? ie, always having the configs field. This makes parsing the metadata easier IMO.

Might also be good to wrap the tags under a datasets_info tag as follows:

description: ...
citation: ...
dataset_infos:
  download_size: 35142551
  dataset_size: 89789763
  version: 1.0.0
  configs:
    - ...
[...]

Let's also keep in mind that extracting YAML metadata from a markdown readme is a bit more fastidious for users than just parsing a JSON file.

Aug 24 '22 12:08 SBrandeis

Let's keep in mind users might rely on dataset_infos.json already.

Yea we'll full full backward compatibility

Let's also keep in mind that extracting YAML metadata from a markdown readme is a bit more fastidious for users than just parsing a JSON file.

The main things that may use or ingest these data IMO are:

users in the UI or IDE
datasets to populate DatasetInfo python object
moon landing which is already parsing YAML

Am I missing something ? If not I think it's ok to use YAML

Might also be good to wrap the tags under a datasets_info tag as follows:

Maybe one single syntax like this then ?

dataset_infos:
- config: unlabeled
  download_size: 35142551
  dataset_size: 89789763
  version: 1.0.0
  splits:
  - name: train
    num_examples: 10000
  features:
  - name: text
    dtype: string
- config: labeled
  download_size: 35142551
  dataset_size: 89789763
  version: 1.0.0
  splits:
  - name: train
    num_examples: 100
  features:
  - name: text
    dtype: string
  - name: label
    dtype: ClassLabel
    names:
    - negative
    - positive

and when you have only one config

dataset_infos:
- config: default
  splits:
  - name: train
    num_examples: 10000
  features:
  - name: text
    dtype: string

Aug 24 '22 13:08 lhoestq

love the idea, and the trend in general to move more things (like tasks) to a single place (YAML).

also, if you browse files on a dataset's page (in "Files and versions"), raw README.md files looks nice and readable, while .json files are just one long line that users need to scroll.

Let's also keep in mind that extracting YAML metadata from a markdown readme is a bit more fastidious for users than just parsing a JSON file.

do users often parse datasets_infos.json file themselves?

Aug 24 '22 13:08 polinaeterna

do users often parse datasets_infos.json file themselves?

Not AFAIK, but I'm sure there should be a few users. Users that access these info via the DatasetInfo from datasets won't see the change though e.g.

>> from datasets import get_datasets_infos
>>> get_datasets_infos("squad")
{'plain_text': DatasetInfo(description='Stanford Question Answering Dataset...

Aug 24 '22 13:08 lhoestq

Maybe one single syntax like this then ?

LGTM!

The main things that may use or ingest these data IMO are:

users in the UI or IDE

datasets to populate DatasetInfo python object

moon landing which is already parsing YAML

Fair point!

Having dataset info in the README's YAML is great for API / huggingface_hub consumers as well as it will be inserted in the cardData field out of the box 🔥

Aug 25 '22 09:08 SBrandeis

Very supportive of this!

Nesting an array of configs inside dataset_infos: sounds good to me. One small tweak is that config: default can be optional for the default config (which can be the first one by convention)

We'll be able to implement metadata validation on the Hub side so we ensure that those metadata are always in the right format (maybe for @coyotte508 ? cc @Pierrci). From a quick glance the features might be the harder part to validate here, any doc will be welcome.

Other high-level points:

as we move from mostly academic datasets to all datasets (which include the data inside the repos), my intuition is that more and more datasets (Hub-stored) are going to be single-config
similarly, less and less datasets will have a loading script, just the data + some metadata
to lower the barrier to entry to contribution, in the long term users shouldn't need to compute/update this data via a command line. It could be filled automatically on the Hub through a "bot" inside Discussions & Pull requests for instance.

Aug 26 '22 15:08 julien-c

re: config: default

Note also that the default config is not named default, afaiu, but create from the repo name, eg: https://huggingface.co/datasets/nbtpj/bionlp2021SAS default config is nbtpj--bionlp2021SAS (which is awful)

Aug 26 '22 16:08 severo

Note also that the default config is not named default, afaiu, but create from the repo name, eg: https://huggingface.co/datasets/nbtpj/bionlp2021SAS default config is nbtpj--bionlp2021SAS (which is awful)

We can change this to default I think or something else

Aug 26 '22 16:08 lhoestq

From a quick glance the features might be the harder part to validate here, any doc will be welcome.

I dug into features validation, see:

the OpenAPI spec: https://github.com/huggingface/datasets-server/blob/main/chart/static-files/openapi.json#L460-L697
the node.js code: https://github.com/huggingface/moon-landing/blob/upgrade-datasets-server-client/server/lib/datasets/FeatureType.ts

Aug 26 '22 16:08 severo

We can change this to default I think or something else

I created https://github.com/huggingface/datasets/issues/4902 to discuss that

Aug 26 '22 16:08 severo

Note also that the default config is not named default, afaiu, but create from the repo name

in case of single-config you can even hide the config name from the UI IMO

I dug into features validation, see: the OpenAPI spec

in moon-landing we use Joi to validate metadata so we would need to generate from Joi code from the OpenAPI spec (or from somewhere else) but I guess that's doable – or just rewrite it manually, as it won't change often

Aug 26 '22 16:08 julien-c

I remember there was an ongoing discussion on this topic:

#3507

I recall some of the concerns raised on that discussion:

@lhoestq: Tensorflow Datasets catalog includes a community catalog where you can find and use HF datasets. They are using the exported dataset_infos.json files from github to get the metadata: #3507 (comment)
@severo: #3507 (comment)
- the metadata header might be very long, before reaching the start of the README/dataset card.
- It also somewhat prevents including large strings like the checksums
- two concepts are mixed in the same file (metadata and documentation). This means that if you're interested only in one of them, you still have to know how to parse the whole file.
@severo: the future "datasets server" could be in charge of generating the dataset-info.json file: #3507 (comment)

Aug 27 '22 05:08 albertvillanova

Thanks for bringing these points up !

@lhoestq: Tensorflow Datasets catalog includes a community catalog where you can find and use HF datasets. They are using the exported dataset_infos.json files from github to get the metadata: https://github.com/huggingface/datasets/issues/3507#issuecomment-1056997627

The TFDS implementation is not super advanced, so it's ok IMO as long as we don't break all the dataset scripts. Note that users can still use to_tf_dataset.

We had a chance to discuss the two nexts points with @julien-c as well:

@severo: https://github.com/huggingface/datasets/issues/3507#issuecomment-1042779776 the metadata header might be very long, before reaching the start of the README/dataset card.

If we don't add the checksums we should be fine. We can also set a maximum number of supported configs in the README to keep it readable.

@severo: the future "datasets server" could be in charge of generating the dataset-info.json file: https://github.com/huggingface/datasets/issues/3507#issuecomment-1033752157

I guess the "HF Hub actions" could open PRs to do the same in the YAML directly

Aug 29 '22 09:08 lhoestq

Thanks for linking that similar discussion for context, @albertvillanova!

Aug 29 '22 14:08 julien-c

datasets datasets copied to clipboard

Move DatasetInfo from `datasets_infos.json` to the YAML tags in `README.md`

Other high-level points:

datasets
datasets copied to clipboard