ValueError: num_samples should be a positive integer value, but got num_samples=0
我想对bge-m3进行unified fine-tuning,按照readme里的步骤进行,但是报错了。
报错信息:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/binbin.zeng/FlagEmbedding/FlagEmbedding/BGE_M3/run.py", line 155, in <module>
main()
File "/home/binbin.zeng/FlagEmbedding/FlagEmbedding/BGE_M3/run.py", line 146, in main
trainer.train()
File "/home/binbin.zeng/miniforge3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py", line 1948, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/binbin.zeng/miniforge3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py", line 1977, in _inner_training_loop
train_dataloader = self.get_train_dataloader()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/binbin.zeng/miniforge3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py", line 915, in get_train_dataloader
dataloader_params["sampler"] = self._get_train_sampler()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/binbin.zeng/miniforge3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py", line 885, in _get_train_sampler
return RandomSampler(self.train_dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/binbin.zeng/miniforge3/envs/py311/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 143, in __init__
raise ValueError(f"num_samples should be a positive integer value, but got num_samples={self.num_samples}")
ValueError: num_samples should be a positive integer value, but got num_samples=0
启动命令:
torchrun --nproc_per_node 8 \
-m FlagEmbedding.BGE_M3.run \
--output_dir model/bge-m3-ft \
--model_name_or_path model/bge-m3 \
--train_data data \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 64 \
--passage_max_len 256 \
--train_group_size 2 \
--negatives_cross_device \
--logging_steps 10 \
--same_task_within_batch True \
--unified_finetuning True \
--use_self_distill True
数据用的是项目提供的toy_data1.jsonl,在data目录下。
我该如何解决呢?
我调试代码,发现run.py中的train_dataset有点问题:
trian_dataset.dataset的len为10,但是train_dataset.batch_datas为空。
@Zeng-B-B , --train_data设置为完整的路径,./data/toy_data1.jsonl
@Zeng-B-B ,
--train_data设置为完整的路径,./data/toy_data1.jsonl
我已经解决这个问题了。我认为,FlagEmbedding/BGE-M3/data.py的代码有点问题,如下:
def refresh_epoch(self):
print(f'---------------------------*Rank {self.process_index}: refresh data---------------------------')
self.deterministic_generator.shuffle(self.datasets_inxs)
# Dynamically adjust batch size
batch_datas = []
for dataset_inx in self.datasets_inxs:
self.deterministic_generator.shuffle(self.each_data_inxs[dataset_inx])
cur_batch_size = self.batch_size_inxs[dataset_inx]*self.num_processes
flag = self.pqloss_flag[dataset_inx]
for start_index in range(0, len(self.each_data_inxs[dataset_inx]), cur_batch_size):
# judge the last batch's length
if len(self.each_data_inxs[dataset_inx]) - start_index < 2 * self.num_processes: # this one
break
batch_datas.append((self.each_data_inxs[dataset_inx][start_index:start_index+cur_batch_size], flag))
self.deterministic_generator.shuffle(batch_datas)
self.batch_datas = batch_datas
self.step = 0
应该把上面这行代码if len(self.each_data_inxs[dataset_inx]) - start_index < 2 * self.num_processes:改成if len(self.each_data_inxs[dataset_inx]) - start_index < cur_batch_size:就不会出现这个问题了。
解释:应该当剩余样本数量小于总的batch_size(total_batch_size)时,total_batch_size=per_device_train_batch_size*self.num_processes,丢掉剩余样本。
我的错误是因为将total_batch_size直接设置为2 * self.num_processes,而我的实际total_batch_size为1 * self.num_processes,我每张卡的per_device_train_batch_size是1,会导致循环直接退出,导致train_dataset.batch_datas为空。