datasets
datasets copied to clipboard
Using load_dataset with data_files and split arguments yields an error
Describe the bug
It seems the list of valid splits recorded by the package becomes incorrectly overwritten when using the data_files argument.
If I run
from datasets import load_dataset
load_dataset("allenai/super", split="all_examples", data_files="tasks/expert.jsonl")
then I get the error
ValueError: Unknown split "all_examples". Should be one of ['train'].
However, if I run
from datasets import load_dataset
load_dataset("allenai/super", split="train", name="Expert")
then I get
ValueError: Unknown split "train". Should be one of ['all_examples'].
Steps to reproduce the bug
Run
from datasets import load_dataset
load_dataset("allenai/super", split="all_examples", data_files="tasks/expert.jsonl")
Expected behavior
No error.
Environment info
Python = 3.12 datasets = 3.2.0
Hi,
I want to work on this issue involving adding a verification test for the HuggingFaceM4/InterleavedWebDocuments dataset.
This will be my first contribution to datasets,
I plan to add a simple loading and basic structure verification test like other recent dataset tests.
You can expect a PR in the next few hours or days.
Thanks for the good first issue!