ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: inference.py error

Open qq31682216 opened this issue 2 years ago • 4 comments

🐛 Describe the bug

按照example的说明跑了一下,发现inference会报格式错误。 命令:python inference.py --model_path ./actor_checkpoint_prompts.pt --pretrain bigscience/bloom-560m --model bloom 输出: Traceback (most recent call last): File "inference.py", line 59, in eval(args) File "inference.py", line 23, in eval actor.model.load_state_dict(state_dict) File "/home/sre/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for BloomForCausalLM: Missing key(s) in state_dict: "transformer.word_embeddings.weight", "transformer.word_embeddings_layernorm.weight", "transformer.word_embeddings_layernorm.bias" ...

Environment

nvcc -V Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0

qq31682216 avatar Mar 09 '23 10:03 qq31682216

example:https://github.com/hpcaitech/ColossalAI/tree/main/applications/ChatGPT/examples

qq31682216 avatar Mar 09 '23 10:03 qq31682216

Try adding strict=False to this line.

JThh avatar Mar 10 '23 03:03 JThh

引用的那一行是无法添加strict=False这个参数的,我看了源码确实不支持。 23行我改为actor.model.load_state_dict(state_dict, strict=False)依然报错: Traceback (most recent call last): File "inference.py", line 60, in eval(args) File "inference.py", line 23, in eval actor.model.load_state_dict(state_dict, strict=False) File "/home/sre/anaconda3/envs/gpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for BloomForCausalLM: size mismatch for transformer.ln_f.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for transformer.ln_f.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for lm_head.weight: copying a param with shape torch.Size([50257, 768]) from checkpoint, the shape in current model is torch.Size([250880, 1024]).

qq31682216 avatar Mar 10 '23 06:03 qq31682216

I guess we can merge the issue with #3061 and request @ht-zhou's help on this.

JThh avatar Mar 10 '23 06:03 JThh