Using load_dataset with data_files and split arguments yields an error

Open devon-research opened this issue 1 year ago • 1 comments

Describe the bug

It seems the list of valid splits recorded by the package becomes incorrectly overwritten when using the data_files argument.

If I run

from datasets import load_dataset
load_dataset("allenai/super", split="all_examples", data_files="tasks/expert.jsonl")

then I get the error

ValueError: Unknown split "all_examples". Should be one of ['train'].

However, if I run

from datasets import load_dataset
load_dataset("allenai/super", split="train", name="Expert")

then I get

ValueError: Unknown split "train". Should be one of ['all_examples'].

Steps to reproduce the bug

Run

from datasets import load_dataset
load_dataset("allenai/super", split="all_examples", data_files="tasks/expert.jsonl")

Expected behavior

No error.

Environment info

Python = 3.12 datasets = 3.2.0

Feb 12 '25 04:02 devon-research

Hi,
I want to work on this issue involving adding a verification test for the HuggingFaceM4/InterleavedWebDocuments dataset.
This will be my first contribution to datasets,
I plan to add a simple loading and basic structure verification test like other recent dataset tests.
You can expect a PR in the next few hours or days.
Thanks for the good first issue!

Nov 21 '25 14:11 venkatsai2004