在加载bookcorpus的过程中,builder_cls为None
在 traindata = load_dataset( 'bookcorpus', split='train' ) 的这一步中, builder_cls = get_dataset_builder_class(dataset_module, dataset_name=dataset_name)得到的builder_cls为None, 所以报错 builder_instance: DatasetBuilder = builder_cls( TypeError: 'NoneType' object is not callable
You can download the data from this link (https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2) and extract it to a folder named "bookcorpus". I solved the issue by doing this, and I hope it helps you as well. @Charlly-D
Oh, thank you very much. But may I ask if you know why it dosen't work, and I find that when the datasets need to be loaded by .py document, it will report an error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte". @pennyLuo-hub
Maybe it's because the file is bookcorpus.tar.bz2 and hasn't been extracted.The data after extraction is as follows:books_large_p1.txt、books_large_p2.txt @Charlly-D .
Thank you very much. @pennyLuo-hub
Very useful solution!! Thanks @pennyLuo-hub
@pennyLuo-hub May I ask where the same error can be obtained for this dataset in PPL? traindata = load_dataset('ptb_text_only', 'penn_treebank', split='train') valdata = load_dataset('ptb_text_only', 'penn_treebank', split='validation')