通过last_checkpoint resume继续训练时的missing keys in checkpoint问题
感谢您使用Issue提问模板,请按照以下步骤提供相关信息。我们将优先处理信息相对完整的Issue,感谢您的配合。
提示:将[ ]中填入x,表示打对钩。提问时删除上面这两行。请只保留符合的选项,删掉其他。
详细描述问题
请尽量具体地描述您遇到的问题。这将有助于我们更快速地定位问题所在。
运行截图或log
(如有必要)请提供文本log或者运行截图,以便我们更好地了解问题详情。
必查项目
- [x] 哪个模型的问题:LLaMA / Alpaca (只保留你要问的)
- [x] 问题类型:(只保留你要问的)
- 其他问题
- [x] 由于相关依赖频繁更新,请确保按照Wiki中的相关步骤执行
- [x] 我已阅读FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案
使用预训练脚本,通过去掉overwrite_output_dir参数,自动resume last_checkpoint继续训练的时候会出现如下警告: [WARNING|trainer.py:2247] 2023-05-12 12:33:29,228 >> There were missing keys in the checkpoint model loaded: ...... 看了一些其他的脚本,resume时候会手动set_peft_model_state_dict,这里需要做任何类似改造吗?(以下是训练启动代码,基本只是去掉了overwrite_output_dir参数)
torchrun \
--nnodes 1 \
--nproc_per_node $num_gpus \
run_clm_pt_with_peft.py \
--preprocessing_num_workers 256 \
--model_name_or_path ${pretrained_model} \
--tokenizer_name_or_path ${chinese_tokenizer_path} \
--dataset_dir ${data_dir} \
--data_cache_dir ${data_cache_dir} \
--validation_split_percentage 0.001 \
--per_device_train_batch_size ${per_device_batch_size} \
--per_device_eval_batch_size ${per_device_batch_size} \
--do_train \
--seed $RANDOM \
--fp16 \
--max_steps ${training_steps} \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--warmup_ratio 0.05 \
--weight_decay 0.01 \
--logging_strategy steps \
--logging_steps 10 \
--save_strategy steps \
--save_total_limit 3 \
--save_steps 50 \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--block_size 512 \
--output_dir ${output_dir} \
--ddp_timeout 30000 \
--logging_first_step True \
--lora_rank ${lora_rank} \
--trainable ${lora_trainable} \
--modules_to_save ${modules_to_save} \
--lora_dropout ${lora_dropout} \
--torch_dtype float16 \
--report_to wandb
没有问题。因为保存的ckpt中就只保存了lora,没有这些权重。
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.
使用预训练脚本,通过去掉overwrite_output_dir参数,加载保存在output的checkpoint-100文件夹,出现RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Hi @chensongcan @puppyapple , Have you fixed it? I get error:
File "/home/tupk/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight",