NonMatchingSplitsSizesError and ExpectedMoreSplitsError
Describe the bug
When loading dataset, the info specified by data_files did not overwrite the original info.
Steps to reproduce the bug
from datasets import load_dataset
traindata = load_dataset(
"allenai/c4",
"en",
data_files={"train": "en/c4-train.00000-of-01024.json.gz",
"validation": "en/c4-validation.00000-of-00008.json.gz"},
)
NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=828589180707, num_examples=364868892, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=809262831, num_examples=356317, shard_lengths=[223006, 133311], dataset_name='c4')}, {'expected': SplitInfo(name='validation', num_bytes=825767266, num_examples=364608, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='validation', num_bytes=102199431, num_examples=45576, shard_lengths=None, dataset_name='c4')}]
from datasets import load_dataset
traindata = load_dataset(
"allenai/c4",
"en",
data_files={"train": "en/c4-train.00000-of-01024.json.gz"},
split="train"
)
ExpectedMoreSplitsError: {'validation'}
Expected behavior
No error
Environment info
datasets 4.0.0
To load just one shard without errors, you should use data_files directly with split set to "train", but don’t specify "allenai/c4", since that points to the full dataset with all shards.
Instead, do this:
from datasets import load_dataset
from datasets import load_dataset
# Load only one shard of C4
traindata = load_dataset(
"json", # <-- use "json" since you’re directly passing JSON files
data_files={"train": "https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz"},
split="train"
)
print(traindata)
If you want both train and validation but only a subset of shards, do:
traindata = load_dataset(
"json",
data_files={
"train": "https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz",
"validation": "https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-validation.00000-of-00008.json.gz"
}
)
print(traindata)
I just want to load a few files from allenai/c4. If I do not specify allenai/c4, where will the files be loaded from?
My apologies, I’ve modified my previous answer. You just need to specify the full path, for example:
https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz
I hope this updated answer is helpful.