datasets
datasets copied to clipboard
load_dataset() should load all subsets, if no specific subset is specified
Feature request
Currently load_dataset() is forcing users to specify a subset. Example
from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA")
ValueError Traceback (most recent call last)
[<ipython-input-10-c0cb49385da6>](https://localhost:8080/#) in <cell line: 2>()
1 from datasets import load_dataset
----> 2 dataset = load_dataset("m-a-p/COIG-CQIA")
3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _create_builder_config(self, config_name, custom_features, **config_kwargs)
582 if not config_kwargs:
583 example_of_usage = f"load_dataset('{self.dataset_name}', '{self.BUILDER_CONFIGS[0].name}')"
--> 584 raise ValueError(
585 "Config name is missing."
586 f"\nPlease pick one among the available configs: {list(self.builder_configs.keys())}"
ValueError: Config name is missing.
Please pick one among the available configs: ['chinese_traditional', 'coig_pc', 'exam', 'finance', 'douban', 'human_value', 'logi_qa', 'ruozhiba', 'segmentfault', 'wiki', 'wikihow', 'xhs', 'zhihu']
Example of usage:
`load_dataset('coig-cqia', 'chinese_traditional')`
This means a dataset cannot contain all the subsets at the same time. I guess one workaround is to manually specify the subset files like in here, which is clumsy.
Motivation
Ideally, if not subset is specified, the API should just try to load all subsets. This makes it much easier to handle datasets w/ subsets.
Your contribution
Not sure since I'm not familiar w/ the lib src.
@xianbaoqian
Feel free to open a PR in m-a-p/COIG-CQIA to define a default subset. Currently there is no default.
You can find some documentation at https://huggingface.co/docs/hub/datasets-manual-configuration#multiple-configurations
@lhoestq
Whilst having a default subset readily available (e.g. all) by the dataset author is an ideal solution, it is not always the reality.
Without the ability to fork the dataset, this can be problematic.
As far as I know, it is not possible at all to specify multiple subsets in a generalized programmatic way without hard coding subset names for a specific dataset.
Even the ability to fetch subset names and loop over them would be sufficient.
Please note that each subset can have different feature columns, thus making it impossible to load them all into a unique Dataset instance.
That is why subsets were created: to support different but related datasets to coexist in a single dataset repository.
If you would like to programmatically get the list of subset names, you can use datasets.get_dataset_config_names: https://huggingface.co/docs/datasets/v2.20.0/en/load_hub#configurations
found a better method in another link that can not only obtain the subset but also get the corresponding split https://huggingface.co/docs/dataset-viewer/splits