LLM-Pruner icon indicating copy to clipboard operation
LLM-Pruner copied to clipboard

在加载bookcorpus的过程中,builder_cls为None

Open Charlly-D opened this issue 1 year ago • 6 comments

在 traindata = load_dataset( 'bookcorpus', split='train' ) 的这一步中, builder_cls = get_dataset_builder_class(dataset_module, dataset_name=dataset_name)得到的builder_cls为None, 所以报错 builder_instance: DatasetBuilder = builder_cls( TypeError: 'NoneType' object is not callable

Charlly-D avatar Nov 16 '24 01:11 Charlly-D

You can download the data from this link (https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2) and extract it to a folder named "bookcorpus". I solved the issue by doing this, and I hope it helps you as well. @Charlly-D

pennyLuo-hub avatar Nov 25 '24 12:11 pennyLuo-hub

Oh, thank you very much. But may I ask if you know why it dosen't work, and I find that when the datasets need to be loaded by .py document, it will report an error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte". @pennyLuo-hub

Charlly-D avatar Nov 26 '24 02:11 Charlly-D

Maybe it's because the file is bookcorpus.tar.bz2 and hasn't been extracted.The data after extraction is as follows:books_large_p1.txt、books_large_p2.txt @Charlly-D .

pennyLuo-hub avatar Nov 26 '24 03:11 pennyLuo-hub

Thank you very much. @pennyLuo-hub

Charlly-D avatar Nov 26 '24 03:11 Charlly-D

Very useful solution!! Thanks @pennyLuo-hub

deadlykitten4 avatar Jan 13 '25 11:01 deadlykitten4

@pennyLuo-hub May I ask where the same error can be obtained for this dataset in PPL? traindata = load_dataset('ptb_text_only', 'penn_treebank', split='train') valdata = load_dataset('ptb_text_only', 'penn_treebank', split='validation')

fxnie avatar Apr 23 '25 02:04 fxnie