PaddleNLP
PaddleNLP copied to clipboard
[Bug]: 继续在自定义数据上预训练ernie3,当中断训练resume后,在save_step时会报找不到vocab.txt
软件环境
- paddlepaddle:
- paddlepaddle-gpu: 2.3.2
- paddlenlp: 2.4.2
使用最近的devlop版本,aistudio环境
重复问题
- [X] I have searched the existing issues
错误描述
save_steps时,会丢失vocab.txt
稳定复现步骤 & 代码
resume训练命令,从13000开始训练,结果15000报错的时候,找不到vocab.txt,我去model_last看了看少了这个,于是copy了一份过来,接着继续从14000开始,15000报存正常,然后下一个20000步时又发生了这个问题
!python3 -u -m paddle.distributed.launch \
--gpus "0,1,2,3" \
--log_dir "/home/aistudio/work" \
/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py \
--model_type "ernie" \
--model_name_or_path "/home/aistudio/work/out/model_last" \
--tokenizer_name_or_path "/home/aistudio/work/out/model_last" \
--continue_training True \
--input_dir "/home/aistudio/work/data/" \
--output_dir "/home/aistudio/work/out" \
--split 800,100,100 \
--max_seq_len 512 \
--binary_head true \
--micro_batch_size 32 \
--max_lr 0.0001 \
--min_lr 0.00001 \
--max_steps 200000 \
--save_steps 5000 \
--checkpoint_steps 1000 \
--decay_steps 190000 \
--weight_decay 0.01 \
--warmup_rate 0.01 \
--grad_clip 1.0 \
--logging_freq 100 \
--num_workers 3 \
--eval_freq 1000 \
--device "gpu"\
--share_folder true \
--hidden_dropout_prob 0.1 \
--attention_probs_dropout_prob 0.1 \
--seed 222
异常信息
[2022-11-04 02:01:12,119] [ DEBUG] - saving models to /home/aistudio/work/out/model_20000
[2022-11-04 02:01:12,120] [ INFO] - tokenizer config file saved in /home/aistudio/work/out/model_20000/tokenizer_config.json
[2022-11-04 02:01:12,120] [ INFO] - Special tokens file saved in /home/aistudio/work/out/model_20000/special_tokens_map.json
[2022-11-04 02:01:12,130] [ INFO] - added tokens file saved in /home/aistudio/work/out/model_20000/added_tokens.json
Traceback (most recent call last):
File "/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py", line 790, in <module>
do_train(config)
File "/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py", line 749, in do_train
args, global_step)
File "/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py", line 719, in save_ckpt
tokenizer.save_pretrained(output_dir)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 1796, in save_pretrained
filename_prefix=filename_prefix,
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 1821, in _save_pretrained
self.save_resources(save_directory)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 1837, in save_resources
copyfile(src_path, dst_path)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/aistudio/work/out/model_last/vocab.txt'
INFO 2022-11-04 02:01:21,760 launch_utils.py:322] terminate process group gid:16691
INFO 2022-11-04 02:01:21,760 launch_utils.py:322] terminate process group gid:16691
INFO 2022-11-04 02:01:21,761 launch_utils.py:322] terminate process group gid:16696
感觉是不是save_steps和checkpoint_steps要设成一致啊?
--model_name_or_path "/home/aistudio/work/out/model_last" \
--tokenizer_name_or_path "/home/aistudio/work/out/model_last" \
继续训练的话,这两个参数不用修改为model_last
。还设置为ernie-3.0-base-zh
即可。
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。