datasets icon indicating copy to clipboard operation
datasets copied to clipboard

ExpectedMoreSplits error when loading C4 dataset

Open billwang485 opened this issue 1 year ago • 5 comments

Describe the bug

I encounter bug when running the example command line

    python main.py \
    --model decapoda-research/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama_7b/unstructured/wanda/ 

The bug occurred at these lines of code (when loading c4 dataset)

traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Steps to reproduce the bug

  1. I encounter bug when running the example command line

Expected behavior

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Environment info

I'm using cuda 12.4, so I use pip install pytorch instead of conda provided in install.md

Also, I've tried another environment using the same commands in install.md, but the same bug occured

billwang485 avatar Mar 21 '24 02:03 billwang485

Hi ! We updated the allenai/c4 repository to allow people to specify which language to load easily (the the c4 dataset page)

To fix this issue you can update datasets and remove the mention of the legacy configuration name "allenai--c4":

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

lhoestq avatar Mar 21 '24 14:03 lhoestq

Did you solve this problem?I have the same bug.It is no use to delete "allenai--c4".

K-THU avatar Apr 01 '24 05:04 K-THU

Did you solve it? I met this problem too.

xuChenSJTU avatar Apr 09 '24 07:04 xuChenSJTU

But after I romove allenai--c4,it still fails

ssy-small-white avatar Apr 21 '24 02:04 ssy-small-white

For me it works this way. I'm using datasets version 2.17.0

davidbhoffmann avatar Apr 22 '24 16:04 davidbhoffmann

First, pip install --upgrade datasets. Second, Update the following two lines of code in data.py (in lib) traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

chaosright avatar Jul 25 '24 16:07 chaosright

The error is in the Wanda repository: https://github.com/locuslab/wanda

  • https://github.com/locuslab/wanda/issues/57

Concretely, in these code lines: https://github.com/locuslab/wanda/blob/8e8fc87b4a2f9955baa7e76e64d5fce7fa8724a6/lib/data.py#L43-L44

Please report there and/or make the fix in their code.

albertvillanova avatar Jul 29 '24 07:07 albertvillanova

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Solved for me ! Thanks!

SimWangArizona avatar Sep 18 '24 19:09 SimWangArizona