LLM-Pruner 在加载bookcorpus的过程中，builder

在 traindata = load_dataset( 'bookcorpus', split='train' ) 的这一步中， builder_cls = get_dataset_builder_class(dataset_module, dataset_name=dataset_name)得到的builder_cls为None，所以报错 builder_instance: DatasetBuilder = builder_cls( TypeError: 'NoneType' object is not callable

Nov 16 '24 01:11 Charlly-D

You can download the data from this link (https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2) and extract it to a folder named "bookcorpus". I solved the issue by doing this, and I hope it helps you as well. @Charlly-D

Nov 25 '24 12:11 pennyLuo-hub

Oh, thank you very much. But may I ask if you know why it dosen't work, and I find that when the datasets need to be loaded by .py document, it will report an error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte". @pennyLuo-hub

Nov 26 '24 02:11 Charlly-D

Maybe it's because the file is bookcorpus.tar.bz2 and hasn't been extracted.The data after extraction is as follows:books_large_p1.txt、books_large_p2.txt @Charlly-D .

Nov 26 '24 03:11 pennyLuo-hub

Thank you very much. @pennyLuo-hub

Nov 26 '24 03:11 Charlly-D

Very useful solution!! Thanks @pennyLuo-hub

Jan 13 '25 11:01 deadlykitten4

@pennyLuo-hub May I ask where the same error can be obtained for this dataset in PPL? traindata = load_dataset('ptb_text_only', 'penn_treebank', split='train') valdata = load_dataset('ptb_text_only', 'penn_treebank', split='validation')

Apr 23 '25 02:04 fxnie

在加载bookcorpus的过程中，builder_cls为None