datasets icon indicating copy to clipboard operation
datasets copied to clipboard

streaming dataset with concatenating splits raises an error

Open Bing-su opened this issue 3 years ago • 1 comments

Describe the bug

streaming dataset with concatenating splits raises an error

Steps to reproduce the bug

from datasets import load_dataset

# no error
repo = "nateraw/ade20k-tiny"
dataset = load_dataset(repo, split="train+validation")
from datasets import load_dataset

# error
repo = "nateraw/ade20k-tiny"
dataset = load_dataset(repo, split="train+validation", streaming=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-4-a6ae02d63899>](https://localhost:8080/#) in <module>()
      3 # error
      4 repo = "nateraw/ade20k-tiny"
----> 5 dataset = load_dataset(repo, split="train+validation", streaming=True)

1 frames
[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in as_streaming_dataset(self, split, base_path)
   1030             splits_generator = splits_generators[split]
   1031         else:
-> 1032             raise ValueError(f"Bad split: {split}. Available splits: {list(splits_generators)}")
   1033 
   1034         # Create a dataset for each of the given splits

ValueError: Bad split: train+validation. Available splits: ['validation', 'train']

Colab

Expected results

load successfully or throws an error saying it is not supported.

Actual results

above

Environment info

  • datasets version: 2.4.0
  • Platform: Windows-10-10.0.22000-SP0 (windows11 x64)
  • Python version: 3.9.13
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.3

Bing-su avatar Aug 09 '22 02:08 Bing-su

Hi! Only the name of a particular split ("train", "test", ...) is supported as a split pattern if streaming=True. We plan to address this limitation soon.

mariosasko avatar Aug 17 '22 12:08 mariosasko

Hi, have you addressed this yet?

surya-narayanan avatar May 11 '23 00:05 surya-narayanan

yes, same error occurs.

from datasets import load_dataset

# error
repo = "nateraw/ade20k-tiny"
dataset = load_dataset(repo, split="train+validation", streaming=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-3-a6ae02d63899>](https://localhost:8080/#) in <cell line: 5>()
      3 # error
      4 repo = "nateraw/ade20k-tiny"
----> 5 dataset = load_dataset(repo, split="train+validation", streaming=True)

1 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in as_streaming_dataset(self, split, base_path)
   1265             splits_generator = splits_generators[split]
   1266         else:
-> 1267             raise ValueError(f"Bad split: {split}. Available splits: {list(splits_generators)}")
   1268 
   1269         # Create a dataset for each of the given splits

ValueError: Bad split: train+validation. Available splits: ['train', 'validation']

google colab, datasets==2.12.0

- huggingface_hub version: 0.14.1
- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: 2.7.12
- Tensorflow: 2.12.0
- Torch: 2.0.0+cu118
- Jinja2: 3.1.2
- Graphviz: 0.20.1
- Pydot: 1.4.2
- Pillow: 8.4.0
- hf_transfer: N/A
- gradio: N/A
- ENDPOINT: https://huggingface.co/
- HUGGINGFACE_HUB_CACHE: /root/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Bing-su avatar May 11 '23 01:05 Bing-su

Hi!, still not fixed this, the truth is that it is an important update for what we want to train the entire dataset because we want to train fast, also should be enabled the function "[train:18%]" for streaming

fahiers avatar Nov 25 '23 14:11 fahiers