Multiple training errors in the pre-training code
Hi, I found that there exist several errors in the pre-training code (the file run.sh) and corresponding code. I have mentioned one in the pull request.Furthermore, it seems that we should use $PATH_TO_DATA_DICT to specific variable in the shell.
After correcting the path and file name, I found another error in the training stage:
=41667/41667=Iterations/Batches
Iteration: 0%| | 0/41667 [00:00<?, ?it/s]Finish Epoch: 0
Iteration: 0%| | 0/41667 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 85, in <module>
run(args)
File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 44, in run
trainer.val()
File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/training.py", line 189, in val
self.model.module.dnabert2.load_state_dict(torch.load(load_dir+'/pytorch_model.bin'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 998, in load
with _open_file_like(f, 'rb') as opened_file:
^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 445, in _open_file_like
return _open_file(name_or_buffer, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 426, in __init__
super().__init__(open(name, mode))
^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './results/epoch1.train_2w.csv.lr3e-06.lrscale100.bs48.maxlength2000.tmp0.05.seed1.con_methodsame_species.mixTrue.mix_layer_num-1.curriculumTrue/10000/pytorch_model.bin'
Would you please share your thoughts about how to address it? Thanks.
Hi @HelloWorldLTY I also encountered the same problem when finishing the first epoch, and still waiting for an answer.
It looks like that the code did not have a recognizable step in saving the pytorch_model.bin, and loaded it directly.
Hi, I finally drop dnabert-s and focus on dnabert2, which seems more feasible.