PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

[Bug]: 继续在自定义数据上预训练ernie3,当中断训练resume后,在save_step时会报找不到vocab.txt

Open zouhan6806504 opened this issue 2 years ago • 1 comments

软件环境

- paddlepaddle:
- paddlepaddle-gpu: 2.3.2
- paddlenlp: 2.4.2
使用最近的devlop版本,aistudio环境

重复问题

  • [X] I have searched the existing issues

错误描述

save_steps时,会丢失vocab.txt

稳定复现步骤 & 代码

resume训练命令,从13000开始训练,结果15000报错的时候,找不到vocab.txt,我去model_last看了看少了这个,于是copy了一份过来,接着继续从14000开始,15000报存正常,然后下一个20000步时又发生了这个问题

!python3 -u  -m paddle.distributed.launch \
    --gpus "0,1,2,3" \
    --log_dir "/home/aistudio/work" \
    /home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py \
    --model_type "ernie" \
    --model_name_or_path "/home/aistudio/work/out/model_last" \
    --tokenizer_name_or_path "/home/aistudio/work/out/model_last" \
    --continue_training True \
    --input_dir "/home/aistudio/work/data/" \
    --output_dir "/home/aistudio/work/out" \
    --split 800,100,100 \
    --max_seq_len 512 \
    --binary_head true \
    --micro_batch_size 32 \
    --max_lr 0.0001 \
    --min_lr 0.00001 \
    --max_steps 200000 \
    --save_steps 5000 \
    --checkpoint_steps 1000 \
    --decay_steps 190000 \
    --weight_decay 0.01 \
    --warmup_rate 0.01 \
    --grad_clip 1.0 \
    --logging_freq 100 \
    --num_workers 3 \
    --eval_freq 1000 \
    --device "gpu"\
    --share_folder true \
    --hidden_dropout_prob 0.1 \
    --attention_probs_dropout_prob 0.1 \
    --seed 222 

异常信息

[2022-11-04 02:01:12,119] [   DEBUG] - saving models to /home/aistudio/work/out/model_20000
[2022-11-04 02:01:12,120] [    INFO] - tokenizer config file saved in /home/aistudio/work/out/model_20000/tokenizer_config.json
[2022-11-04 02:01:12,120] [    INFO] - Special tokens file saved in /home/aistudio/work/out/model_20000/special_tokens_map.json
[2022-11-04 02:01:12,130] [    INFO] - added tokens file saved in /home/aistudio/work/out/model_20000/added_tokens.json
Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py", line 790, in <module>
    do_train(config)
  File "/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py", line 749, in do_train
    args, global_step)
  File "/home/aistudio/PaddleNLP-develop/model_zoo/ernie-1.0/run_pretrain.py", line 719, in save_ckpt
    tokenizer.save_pretrained(output_dir)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 1796, in save_pretrained
    filename_prefix=filename_prefix,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 1821, in _save_pretrained
    self.save_resources(save_directory)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 1837, in save_resources
    copyfile(src_path, dst_path)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/aistudio/work/out/model_last/vocab.txt'
INFO 2022-11-04 02:01:21,760 launch_utils.py:322] terminate process group gid:16691
INFO 2022-11-04 02:01:21,760 launch_utils.py:322] terminate process group gid:16691
INFO 2022-11-04 02:01:21,761 launch_utils.py:322] terminate process group gid:16696

感觉是不是save_steps和checkpoint_steps要设成一致啊?

zouhan6806504 avatar Nov 03 '22 18:11 zouhan6806504

    --model_name_or_path "/home/aistudio/work/out/model_last" \
    --tokenizer_name_or_path "/home/aistudio/work/out/model_last" \

继续训练的话,这两个参数不用修改为model_last。还设置为ernie-3.0-base-zh即可。

ZHUI avatar Nov 04 '22 05:11 ZHUI

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar Jan 04 '23 00:01 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

github-actions[bot] avatar Jan 18 '23 00:01 github-actions[bot]