InternEvo [QA] InternEvo能否load预训练llama2的参数

描述问题

InternEvo能否load预训练llama2的参数，再继续预训练，用hf的格式还是原始的格式

Jul 02 '24 13:07 JunZhan2000

Traceback (most recent call last): File "/root/dataDisk/internlm/openpoet/train.py", line 335, in
main(args) File "/root/dataDisk/internlm/openpoet/train.py", line 149, in main
ckpt_manager.try_resume_training(train_state, current_time) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 551, in try_resume_training
load_content_str = load_func(self, self.load_ckpt_info, train_state) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 208, in try_load_internlm_ckpt_func
func(folder=load_ckpt_folder, model=ckpt_mm.model) File "/root/dataDisk/internlm/internlm/checkpoint/load_funcs.py", line 179, in load_hf_llama_pretrained_weights
missing_keys, unexpected_keys = model.load_state_dict(current_states, strict=False) File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Llama2:
size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).

Jul 05 '24 18:07 JunZhan2000

@zigzagcai 帮忙看看

Jul 09 '24 04:07 sunpengsdu

支持的

load预训练llama2的逻辑在这里： https://github.com/InternLM/InternEvo/blob/develop/internlm/checkpoint/load_funcs.py#L74-L187
load预训练llama2使用的是hf格式

Jul 10 '24 13:07 zigzagcai

Traceback (most recent call last): File "/root/dataDisk/internlm/openpoet/train.py", line 335, in main(args) File "/root/dataDisk/internlm/openpoet/train.py", line 149, in main ckpt_manager.try_resume_training(train_state, current_time) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 551, in try_resume_training load_content_str = load_func(self, self.load_ckpt_info, train_state) File "/root/dataDisk/internlm/internlm/checkpoint/checkpoint_manager.py", line 208, in try_load_internlm_ckpt_func func(folder=load_ckpt_folder, model=ckpt_mm.model) File "/root/dataDisk/internlm/internlm/checkpoint/load_funcs.py", line 179, in load_hf_llama_pretrained_weights missing_keys, unexpected_keys = model.load_state_dict(current_states, strict=False) File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Llama2: size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).

我可以复现这个报错，已经在这个PR修复：https://github.com/InternLM/InternEvo/pull/276

这个bug的原因在于在早期版本的InternEvo LLaMA实现中，ffn w2和w3的层与meta发布的LLaMA反了。在后来与meta LLaMA对齐之后，load func没有同步更新导致。

Jul 12 '24 04:07 zigzagcai