datasets NonMatchingSplitsSizesError and ExpectedMoreSplitsError

Describe the bug

When loading dataset, the info specified by data_files did not overwrite the original info.

Steps to reproduce the bug

from datasets import load_dataset

traindata = load_dataset(
            "allenai/c4",
            "en",
            data_files={"train": "en/c4-train.00000-of-01024.json.gz", 
                        "validation": "en/c4-validation.00000-of-00008.json.gz"},
        )

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=828589180707, num_examples=364868892, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=809262831, num_examples=356317, shard_lengths=[223006, 133311], dataset_name='c4')}, {'expected': SplitInfo(name='validation', num_bytes=825767266, num_examples=364608, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='validation', num_bytes=102199431, num_examples=45576, shard_lengths=None, dataset_name='c4')}]

from datasets import load_dataset

traindata = load_dataset(
            "allenai/c4",
            "en",
            data_files={"train": "en/c4-train.00000-of-01024.json.gz"},
            split="train"
        )

ExpectedMoreSplitsError: {'validation'}

Expected behavior

No error

Environment info

datasets 4.0.0

Aug 07 '25 04:08 efsotr

To load just one shard without errors, you should use data_files directly with split set to "train", but don’t specify "allenai/c4", since that points to the full dataset with all shards.

Instead, do this:

from datasets import load_dataset
from datasets import load_dataset

# Load only one shard of C4
traindata = load_dataset(
    "json",   # <-- use "json" since you’re directly passing JSON files
    data_files={"train": "https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz"},
    split="train"
)

print(traindata)

If you want both train and validation but only a subset of shards, do:

traindata = load_dataset(
    "json",
    data_files={
        "train": "https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz",
        "validation": "https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-validation.00000-of-00008.json.gz"
    }
)

print(traindata)

Oct 05 '25 10:10 hBouanane

I just want to load a few files from allenai/c4. If I do not specify allenai/c4, where will the files be loaded from?

Oct 06 '25 16:10 efsotr

My apologies, I’ve modified my previous answer. You just need to specify the full path, for example:

https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz

I hope this updated answer is helpful.

Oct 06 '25 21:10 hBouanane