datasets load_dataset() should load all subsets, if no specific subset is specified

Feature request

Currently load_dataset() is forcing users to specify a subset. Example

from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA")

ValueError                                Traceback (most recent call last)
[<ipython-input-10-c0cb49385da6>](https://localhost:8080/#) in <cell line: 2>()
      1 from datasets import load_dataset
----> 2 dataset = load_dataset("m-a-p/COIG-CQIA")

3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _create_builder_config(self, config_name, custom_features, **config_kwargs)
    582                     if not config_kwargs:
    583                         example_of_usage = f"load_dataset('{self.dataset_name}', '{self.BUILDER_CONFIGS[0].name}')"
--> 584                         raise ValueError(
    585                             "Config name is missing."
    586                             f"\nPlease pick one among the available configs: {list(self.builder_configs.keys())}"

ValueError: Config name is missing.
Please pick one among the available configs: ['chinese_traditional', 'coig_pc', 'exam', 'finance', 'douban', 'human_value', 'logi_qa', 'ruozhiba', 'segmentfault', 'wiki', 'wikihow', 'xhs', 'zhihu']
Example of usage:
	`load_dataset('coig-cqia', 'chinese_traditional')`

This means a dataset cannot contain all the subsets at the same time. I guess one workaround is to manually specify the subset files like in here, which is clumsy.

Motivation

Ideally, if not subset is specified, the API should just try to load all subsets. This makes it much easier to handle datasets w/ subsets.

Your contribution

Not sure since I'm not familiar w/ the lib src.

Jun 04 '24 11:06 windmaple

@xianbaoqian

Jun 04 '24 11:06 windmaple

Feel free to open a PR in m-a-p/COIG-CQIA to define a default subset. Currently there is no default.

You can find some documentation at https://huggingface.co/docs/hub/datasets-manual-configuration#multiple-configurations

Jun 13 '24 16:06 lhoestq

@lhoestq

Whilst having a default subset readily available (e.g. all) by the dataset author is an ideal solution, it is not always the reality.

Without the ability to fork the dataset, this can be problematic.

As far as I know, it is not possible at all to specify multiple subsets in a generalized programmatic way without hard coding subset names for a specific dataset.

Even the ability to fetch subset names and loop over them would be sufficient.

Jun 24 '24 23:06 brthor

Please note that each subset can have different feature columns, thus making it impossible to load them all into a unique Dataset instance.

That is why subsets were created: to support different but related datasets to coexist in a single dataset repository.

If you would like to programmatically get the list of subset names, you can use datasets.get_dataset_config_names: https://huggingface.co/docs/datasets/v2.20.0/en/load_hub#configurations

Jun 25 '24 05:06 albertvillanova

found a better method in another link that can not only obtain the subset but also get the corresponding split https://huggingface.co/docs/dataset-viewer/splits

Nov 26 '24 08:11 zhuwenxing

datasets datasets copied to clipboard

load_dataset() should load all subsets, if no specific subset is specified

Feature request

Motivation

Your contribution

datasets
datasets copied to clipboard