ExpectedMoreSplits error when loading C4 dataset
Describe the bug
I encounter bug when running the example command line
python main.py \
--model decapoda-research/llama-7b-hf \
--prune_method wanda \
--sparsity_ratio 0.5 \
--sparsity_type unstructured \
--save out/llama_7b/unstructured/wanda/
The bug occurred at these lines of code (when loading c4 dataset)
traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
The error message states:
raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}
Steps to reproduce the bug
- I encounter bug when running the example command line
Expected behavior
The error message states:
raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}
Environment info
I'm using cuda 12.4, so I use pip install pytorch instead of conda provided in install.md
Also, I've tried another environment using the same commands in install.md, but the same bug occured
Hi ! We updated the allenai/c4 repository to allow people to specify which language to load easily (the the c4 dataset page)
To fix this issue you can update datasets and remove the mention of the legacy configuration name "allenai--c4":
traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
Did you solve this problem?I have the same bug.It is no use to delete "allenai--c4".
Did you solve it? I met this problem too.
But after I romove allenai--c4,it still fails
For me it works this way. I'm using datasets version 2.17.0
First, pip install --upgrade datasets. Second, Update the following two lines of code in data.py (in lib) traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
The error is in the Wanda repository: https://github.com/locuslab/wanda
- https://github.com/locuslab/wanda/issues/57
Concretely, in these code lines: https://github.com/locuslab/wanda/blob/8e8fc87b4a2f9955baa7e76e64d5fce7fa8724a6/lib/data.py#L43-L44
Please report there and/or make the fix in their code.
traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
Solved for me ! Thanks!