datasets
datasets copied to clipboard
Move DatasetInfo from `datasets_infos.json` to the YAML tags in `README.md`
Currently there are two places to find metadata for datasets:
- datasets_infos.json, which contains per dataset config
- description
- citation
- license
- splits and sizes
- checksums of the data files
- feature types
- and more
- YAML tags, which contain
- license
- language
- train-eval-index
- and more
It would be nice to have a single place instead. We can rely on the YAML tags more than the JSON file for consistency with models. And it would all be indexed by our back-end directly, which is nice to have.
One way would be to move everything to the YAML tags except the checksums (there can be tens of thousands of them). The description/citation is already in the dataset card so we probably don't need to have them in the YAML card, it would be redundant.
Here is an example for SQuAD
download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
num_examples: 87599
num_bytes: 79317110
- name: validation
num_examples: 10570
num_bytes: 10472653
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
struct:
- name: text
list:
dtype: string
- name: answer_start
list:
dtype: int32
Since there is only one configuration for SQuAD, this structure is ok. For datasets with several configs we can see in a second step, but IMO it would be ok to have these fields per config using another syntax
configs:
- config: unlabeled
splits:
- name: train
num_examples: 10000
features:
- name: text
dtype: string
- config: labeled
splits:
- name: train
num_examples: 100
features:
- name: text
dtype: string
- name: label
dtype: ClassLabel
names:
- negative
- positive
So in the end you could specify a YAML tag either at the top level (for all configs) or per config in the configs
field
Alternatively we could keep config specific stuff in the dataset_infos.json
as it it today
Not sure yet what's the best approach here but cc @julien-c @mariosasko @albertvillanova @polinaeterna for feedback :)
also @osanseviero @Pierrci @SBrandeis potentially
Love this in principle 🚀
Let's keep in mind users might rely on dataset_infos.json
already.
I'm not convinced by the two-syntax solution, wouldn't it be simpler to have only one syntax with a default
config for datasets with only one config? ie, always having the configs
field. This makes parsing the metadata easier IMO.
Might also be good to wrap the tags under a datasets_info
tag as follows:
description: ...
citation: ...
dataset_infos:
download_size: 35142551
dataset_size: 89789763
version: 1.0.0
configs:
- ...
[...]
Let's also keep in mind that extracting YAML metadata from a markdown readme is a bit more fastidious for users than just parsing a JSON file.
Let's keep in mind users might rely on dataset_infos.json already.
Yea we'll full full backward compatibility
Let's also keep in mind that extracting YAML metadata from a markdown readme is a bit more fastidious for users than just parsing a JSON file.
The main things that may use or ingest these data IMO are:
- users in the UI or IDE
-
datasets
to populateDatasetInfo
python object - moon landing which is already parsing YAML
Am I missing something ? If not I think it's ok to use YAML
Might also be good to wrap the tags under a datasets_info tag as follows:
Maybe one single syntax like this then ?
dataset_infos:
- config: unlabeled
download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
num_examples: 10000
features:
- name: text
dtype: string
- config: labeled
download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
num_examples: 100
features:
- name: text
dtype: string
- name: label
dtype: ClassLabel
names:
- negative
- positive
and when you have only one config
dataset_infos:
- config: default
splits:
- name: train
num_examples: 10000
features:
- name: text
dtype: string
love the idea, and the trend in general to move more things (like tasks) to a single place (YAML).
also, if you browse files on a dataset's page (in "Files and versions"), raw README.md
files looks nice and readable, while .json
files are just one long line that users need to scroll.
Let's also keep in mind that extracting YAML metadata from a markdown readme is a bit more fastidious for users than just parsing a JSON file.
do users often parse datasets_infos.json
file themselves?
do users often parse datasets_infos.json file themselves?
Not AFAIK, but I'm sure there should be a few users.
Users that access these info via the DatasetInfo
from datasets
won't see the change though e.g.
>> from datasets import get_datasets_infos
>>> get_datasets_infos("squad")
{'plain_text': DatasetInfo(description='Stanford Question Answering Dataset...
Maybe one single syntax like this then ?
LGTM!
The main things that may use or ingest these data IMO are:
- users in the UI or IDE
- datasets to populate DatasetInfo python object
- moon landing which is already parsing YAML
Fair point!
Having dataset info in the README's YAML is great for API / huggingface_hub
consumers as well as it will be inserted in the cardData
field out of the box 🔥
Very supportive of this!
Nesting an array of configs inside dataset_infos:
sounds good to me. One small tweak is that config: default
can be optional for the default config (which can be the first one by convention)
We'll be able to implement metadata validation on the Hub side so we ensure that those metadata are always in the right format (maybe for @coyotte508 ? cc @Pierrci). From a quick glance the features
might be the harder part to validate here, any doc will be welcome.
Other high-level points:
- as we move from mostly academic datasets to all datasets (which include the data inside the repos), my intuition is that more and more datasets (Hub-stored) are going to be single-config
- similarly, less and less datasets will have a loading script, just the data + some metadata
- to lower the barrier to entry to contribution, in the long term users shouldn't need to compute/update this data via a command line. It could be filled automatically on the Hub through a "bot" inside Discussions & Pull requests for instance.
re: config: default
Note also that the default config is not named default
, afaiu, but create from the repo name, eg: https://huggingface.co/datasets/nbtpj/bionlp2021SAS default config is nbtpj--bionlp2021SAS
(which is awful)
Note also that the default config is not named default, afaiu, but create from the repo name, eg: https://huggingface.co/datasets/nbtpj/bionlp2021SAS default config is nbtpj--bionlp2021SAS (which is awful)
We can change this to default
I think or something else
From a quick glance the features might be the harder part to validate here, any doc will be welcome.
I dug into features validation, see:
- the OpenAPI spec: https://github.com/huggingface/datasets-server/blob/main/chart/static-files/openapi.json#L460-L697
- the node.js code: https://github.com/huggingface/moon-landing/blob/upgrade-datasets-server-client/server/lib/datasets/FeatureType.ts
We can change this to default I think or something else
I created https://github.com/huggingface/datasets/issues/4902 to discuss that
Note also that the default config is not named
default
, afaiu, but create from the repo name
in case of single-config you can even hide the config name from the UI IMO
I dug into features validation, see: the OpenAPI spec
in moon-landing we use Joi to validate metadata so we would need to generate from Joi code from the OpenAPI spec (or from somewhere else) but I guess that's doable – or just rewrite it manually, as it won't change often
I remember there was an ongoing discussion on this topic:
- #3507
I recall some of the concerns raised on that discussion:
- @lhoestq: Tensorflow Datasets catalog includes a community catalog where you can find and use HF datasets. They are using the exported dataset_infos.json files from github to get the metadata: #3507 (comment)
- @severo: #3507 (comment)
- the metadata header might be very long, before reaching the start of the README/dataset card.
- It also somewhat prevents including large strings like the checksums
- two concepts are mixed in the same file (metadata and documentation). This means that if you're interested only in one of them, you still have to know how to parse the whole file.
- @severo: the future "datasets server" could be in charge of generating the dataset-info.json file: #3507 (comment)
Thanks for bringing these points up !
@lhoestq: Tensorflow Datasets catalog includes a community catalog where you can find and use HF datasets. They are using the exported dataset_infos.json files from github to get the metadata: https://github.com/huggingface/datasets/issues/3507#issuecomment-1056997627
The TFDS implementation is not super advanced, so it's ok IMO as long as we don't break all the dataset scripts. Note that users can still use to_tf_dataset
.
We had a chance to discuss the two nexts points with @julien-c as well:
@severo: https://github.com/huggingface/datasets/issues/3507#issuecomment-1042779776 the metadata header might be very long, before reaching the start of the README/dataset card.
If we don't add the checksums we should be fine. We can also set a maximum number of supported configs in the README to keep it readable.
@severo: the future "datasets server" could be in charge of generating the dataset-info.json file: https://github.com/huggingface/datasets/issues/3507#issuecomment-1033752157
I guess the "HF Hub actions" could open PRs to do the same in the YAML directly
Thanks for linking that similar discussion for context, @albertvillanova!