datasets icon indicating copy to clipboard operation
datasets copied to clipboard

load_dataset() should load all subsets, if no specific subset is specified

Open windmaple opened this issue 1 year ago • 4 comments

Feature request

Currently load_dataset() is forcing users to specify a subset. Example

from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA")

ValueError                                Traceback (most recent call last)
[<ipython-input-10-c0cb49385da6>](https://localhost:8080/#) in <cell line: 2>()
      1 from datasets import load_dataset
----> 2 dataset = load_dataset("m-a-p/COIG-CQIA")

3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _create_builder_config(self, config_name, custom_features, **config_kwargs)
    582                     if not config_kwargs:
    583                         example_of_usage = f"load_dataset('{self.dataset_name}', '{self.BUILDER_CONFIGS[0].name}')"
--> 584                         raise ValueError(
    585                             "Config name is missing."
    586                             f"\nPlease pick one among the available configs: {list(self.builder_configs.keys())}"

ValueError: Config name is missing.
Please pick one among the available configs: ['chinese_traditional', 'coig_pc', 'exam', 'finance', 'douban', 'human_value', 'logi_qa', 'ruozhiba', 'segmentfault', 'wiki', 'wikihow', 'xhs', 'zhihu']
Example of usage:
	`load_dataset('coig-cqia', 'chinese_traditional')`

This means a dataset cannot contain all the subsets at the same time. I guess one workaround is to manually specify the subset files like in here, which is clumsy.

Motivation

Ideally, if not subset is specified, the API should just try to load all subsets. This makes it much easier to handle datasets w/ subsets.

Your contribution

Not sure since I'm not familiar w/ the lib src.

windmaple avatar Jun 04 '24 11:06 windmaple

@xianbaoqian

windmaple avatar Jun 04 '24 11:06 windmaple

Feel free to open a PR in m-a-p/COIG-CQIA to define a default subset. Currently there is no default.

You can find some documentation at https://huggingface.co/docs/hub/datasets-manual-configuration#multiple-configurations

lhoestq avatar Jun 13 '24 16:06 lhoestq

@lhoestq

Whilst having a default subset readily available (e.g. all) by the dataset author is an ideal solution, it is not always the reality.

Without the ability to fork the dataset, this can be problematic.

As far as I know, it is not possible at all to specify multiple subsets in a generalized programmatic way without hard coding subset names for a specific dataset.

Even the ability to fetch subset names and loop over them would be sufficient.

brthor avatar Jun 24 '24 23:06 brthor

Please note that each subset can have different feature columns, thus making it impossible to load them all into a unique Dataset instance.

That is why subsets were created: to support different but related datasets to coexist in a single dataset repository.

If you would like to programmatically get the list of subset names, you can use datasets.get_dataset_config_names: https://huggingface.co/docs/datasets/v2.20.0/en/load_hub#configurations

albertvillanova avatar Jun 25 '24 05:06 albertvillanova

found a better method in another link that can not only obtain the subset but also get the corresponding split https://huggingface.co/docs/dataset-viewer/splits

zhuwenxing avatar Nov 26 '24 08:11 zhuwenxing