datasets
datasets copied to clipboard
streaming dataset with concatenating splits raises an error
Describe the bug
streaming dataset with concatenating splits raises an error
Steps to reproduce the bug
from datasets import load_dataset
# no error
repo = "nateraw/ade20k-tiny"
dataset = load_dataset(repo, split="train+validation")
from datasets import load_dataset
# error
repo = "nateraw/ade20k-tiny"
dataset = load_dataset(repo, split="train+validation", streaming=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-4-a6ae02d63899>](https://localhost:8080/#) in <module>()
3 # error
4 repo = "nateraw/ade20k-tiny"
----> 5 dataset = load_dataset(repo, split="train+validation", streaming=True)
1 frames
[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in as_streaming_dataset(self, split, base_path)
1030 splits_generator = splits_generators[split]
1031 else:
-> 1032 raise ValueError(f"Bad split: {split}. Available splits: {list(splits_generators)}")
1033
1034 # Create a dataset for each of the given splits
ValueError: Bad split: train+validation. Available splits: ['validation', 'train']
Expected results
load successfully or throws an error saying it is not supported.
Actual results
above
Environment info
datasetsversion: 2.4.0- Platform: Windows-10-10.0.22000-SP0 (windows11 x64)
- Python version: 3.9.13
- PyArrow version: 8.0.0
- Pandas version: 1.4.3
Hi! Only the name of a particular split ("train", "test", ...) is supported as a split pattern if streaming=True. We plan to address this limitation soon.
Hi, have you addressed this yet?
yes, same error occurs.
from datasets import load_dataset
# error
repo = "nateraw/ade20k-tiny"
dataset = load_dataset(repo, split="train+validation", streaming=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-3-a6ae02d63899>](https://localhost:8080/#) in <cell line: 5>()
3 # error
4 repo = "nateraw/ade20k-tiny"
----> 5 dataset = load_dataset(repo, split="train+validation", streaming=True)
1 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in as_streaming_dataset(self, split, base_path)
1265 splits_generator = splits_generators[split]
1266 else:
-> 1267 raise ValueError(f"Bad split: {split}. Available splits: {list(splits_generators)}")
1268
1269 # Create a dataset for each of the given splits
ValueError: Bad split: train+validation. Available splits: ['train', 'validation']
google colab, datasets==2.12.0
- huggingface_hub version: 0.14.1
- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: 2.7.12
- Tensorflow: 2.12.0
- Torch: 2.0.0+cu118
- Jinja2: 3.1.2
- Graphviz: 0.20.1
- Pydot: 1.4.2
- Pillow: 8.4.0
- hf_transfer: N/A
- gradio: N/A
- ENDPOINT: https://huggingface.co/
- HUGGINGFACE_HUB_CACHE: /root/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
Hi!, still not fixed this, the truth is that it is an important update for what we want to train the entire dataset because we want to train fast, also should be enabled the function "[train:18%]" for streaming