BertWithPretrained icon indicating copy to clipboard operation
BertWithPretrained copied to clipboard

songci数据集,wiki2预训练时会报错,生成的掩码pt文件wiki_train_mlNone_rs2022_mr15_mtr8_mtur5.pt只有1k

Open Phil-521 opened this issue 2 years ago • 4 comments

注意,正在使用本地MyTransformer中的MyMultiHeadAttention实现

[2022-11-27 15:03:35] - INFO: ## 使用token embedding中的权重矩阵作为输出层的权重!torch.Size([30522, 768]) [2022-11-27 15:03:38] - INFO: 缓存文件 /home/********/博一/my_explore/BERT_learn/BertWithPretrained-main/data/WikiText/wiki_test_mlNone_rs2022_mr15_mtr8_mtur5.pt 不存在,重新处理并缓存!

正在读取原始数据: 100%|██████████████| 4358/4358 [00:00<00:00, 11122.89it/s]

正在构造NSP和MLM样本(test): 100%|██| 1847/1847 [00:00<00:00, 1681180.44it/s]

[2022-11-27 15:03:38] - INFO: 缓存文件 /home/********/博一/my_explore/BERT_learn/BertWithPretrained-main/data/WikiText/wiki_train_mlNone_rs2022_mr15_mtr8_mtur5.pt 不存在,重新处理并缓存!

正在读取原始数据: 100%|████████████| 36718/36718 [00:03<00:00, 11100.30it/s]

正在构造NSP和MLM样本(train): 100%|█| 15496/15496 [00:00<00:00, 1615704.25it/

Traceback (most recent call last): File "TaskForPretraining.py", line 300, in train(config) File "TaskForPretraining.py", line 105, in train val_file_path=config.val_file_path) File "../utils/create_pretraining_data.py", line 334, in load_train_val_test_data collate_fn=self.generate_batch) File "/home/pgrad/.conda/envs/wmc_transformer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 213, in init sampler = RandomSampler(dataset) File "/home/pgrad/.conda/envs/wmc_transformer/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 94, in init "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0

Phil-521 avatar Nov 27 '22 08:11 Phil-521

遇到同样的问题,请问老哥你解决了吗?

haidequanbu avatar Dec 02 '22 02:12 haidequanbu

已解决可以看我另外一个issuehttps://github.com/moon-hotel/BertWithPretrained/issues/15 issuse15,复制到浏览器打开

haidequanbu avatar Dec 03 '22 13:12 haidequanbu

这个问题得你自己去排除一下,我没有遇到过。 感觉像是你这个PyTorch版本多了一个num_samples参数?

moon-hotel avatar Dec 04 '22 00:12 moon-hotel

应该是数据集切分的问题,跑wiki数据集时设置ModelConfig.seps='.'

yfzsj01 avatar May 19 '23 07:05 yfzsj01