datasets ExpectedMoreSplits error when loading C4 dataset

Describe the bug

I encounter bug when running the example command line

    python main.py \
    --model decapoda-research/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama_7b/unstructured/wanda/

The bug occurred at these lines of code (when loading c4 dataset)

traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Steps to reproduce the bug

I encounter bug when running the example command line

Expected behavior

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Environment info

I'm using cuda 12.4, so I use pip install pytorch instead of conda provided in install.md

Also, I've tried another environment using the same commands in install.md, but the same bug occured

Mar 21 '24 02:03 billwang485

Hi ! We updated the allenai/c4 repository to allow people to specify which language to load easily (the the c4 dataset page)

To fix this issue you can update datasets and remove the mention of the legacy configuration name "allenai--c4":

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Mar 21 '24 14:03 lhoestq

Did you solve this problem？I have the same bug.It is no use to delete "allenai--c4".

Apr 01 '24 05:04 K-THU

Did you solve it? I met this problem too.

Apr 09 '24 07:04 xuChenSJTU

But after I romove allenai--c4,it still fails

Apr 21 '24 02:04 ssy-small-white

For me it works this way. I'm using datasets version 2.17.0

Apr 22 '24 16:04 davidbhoffmann

First, pip install --upgrade datasets. Second, Update the following two lines of code in data.py (in lib) traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Jul 25 '24 16:07 chaosright

The error is in the Wanda repository: https://github.com/locuslab/wanda

https://github.com/locuslab/wanda/issues/57

Concretely, in these code lines: https://github.com/locuslab/wanda/blob/8e8fc87b4a2f9955baa7e76e64d5fce7fa8724a6/lib/data.py#L43-L44

Please report there and/or make the fix in their code.

Jul 29 '24 07:07 albertvillanova

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Solved for me ! Thanks!

Sep 18 '24 19:09 SimWangArizona