FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

模型保存时有问题

Open passionate11 opened this issue 11 months ago • 1 comments

Traceback (most recent call last): File "/opt/tiger/FlagEmbedding/FlagEmbedding/baai_general_embedding/finetune/run.py", line 114, in main() File "/opt/tiger/FlagEmbedding/FlagEmbedding/baai_general_embedding/finetune/run.py", line 103, in main trainer.train() File "/usr/local/lib/python3.9/dist-packages/transformers/trainer.py", line 1624, in train return inner_training_loop( File "/usr/local/lib/python3.9/dist-packages/transformers/trainer.py", line 2049, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.9/dist-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/usr/local/lib/python3.9/dist-packages/transformers/trainer.py", line 2555, in _save_checkpoint shutil.rmtree(staging_output_dir) File "/usr/lib/python3.9/shutil.py", line 722, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib/python3.9/shutil.py", line 720, in rmtree os.rmdir(path) FileNotFoundError: [Errno 2] No such file or directory: '/checkpoints_tmp/tmp-checkpoint-1' 我的实际模型保存路径为/checkpoints_tmp/checkpoint-1,我如果单节点训练就可以正常保存,切换到多节点就会出现这个问题,请问怎么解决?具体运行代码如下: torchrun --nproc_per_node --nnodes
--node_rank=--master_addr
--master_port $PORT
run.py
--output_dir $model_output_dir
--model_name_or_path $model_path
--train_data $data_dir
--learning_rate 1e-5
--save_strategy epoch
--per_device_train_batch_size $per_device_train_batch_size
--fp16
--num_train_epochs $num_train_epochs
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 64
--train_group_size $train_group_size
--negatives_cross_device
--logging_steps 10

passionate11 avatar Mar 20 '24 10:03 passionate11

可能是由于没有在根目录创建文件夹的权限。建议换一个地方存储,比如当前目录 ./checkpoints_tmp/

staoxiao avatar Mar 20 '24 14:03 staoxiao