[QA] InternEvo能否load预训练llama2的参数
描述问题
InternEvo能否load预训练llama2的参数,再继续预训练,用hf的格式还是原始的格式
Traceback (most recent call last): File "/root/dataDisk/internlm/openpoet/train.py", line 335, in
main(args) File "/root/dataDisk/internlm/openpoet/train.py", line 149, in main
ckpt_manager.try_resume_training(train_state, current_time) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 551, in try_resume_training
load_content_str = load_func(self, self.load_ckpt_info, train_state) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 208, in try_load_internlm_ckpt_func
func(folder=load_ckpt_folder, model=ckpt_mm.model) File "/root/dataDisk/internlm/internlm/checkpoint/load_funcs.py", line 179, in load_hf_llama_pretrained_weights
missing_keys, unexpected_keys = model.load_state_dict(current_states, strict=False) File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Llama2:
size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
@zigzagcai 帮忙看看
支持的
-
load预训练llama2的逻辑在这里: https://github.com/InternLM/InternEvo/blob/develop/internlm/checkpoint/load_funcs.py#L74-L187
-
load预训练llama2使用的是hf格式
Traceback (most recent call last): File "/root/dataDisk/internlm/openpoet/train.py", line 335, in main(args) File "/root/dataDisk/internlm/openpoet/train.py", line 149, in main ckpt_manager.try_resume_training(train_state, current_time) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 551, in try_resume_training load_content_str = load_func(self, self.load_ckpt_info, train_state) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 208, in try_load_internlm_ckpt_func func(folder=load_ckpt_folder, model=ckpt_mm.model) File "/root/dataDisk/internlm/internlm/checkpoint/load_funcs.py", line 179, in load_hf_llama_pretrained_weights missing_keys, unexpected_keys = model.load_state_dict(current_states, strict=False) File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Llama2: size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
我可以复现这个报错,已经在这个PR修复:https://github.com/InternLM/InternEvo/pull/276
这个bug的原因在于在早期版本的InternEvo LLaMA实现中,ffn w2和w3的层与meta发布的LLaMA反了。在后来与meta LLaMA对齐之后,load func没有同步更新导致。