InternEvo
InternEvo copied to clipboard
[Bug] 训练bf16 infer fp16出现NaN
Describe the bug
我来重新描述一下我的问题,我在用internevo训练的时候用的bf16,然后转换成hf后用fp16推理遇到了下述报错
Traceback (most recent call last):
File "/InternLM/hf_test.py", line 15, in <module>
output = model.generate(**inputs, **gen_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2734, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
这个错误是由于model load进来的时候torch_dtype的,如果我改成torch_type=torch.bfloat16或者torch.float32都是没有问题的,但是torch.float16会存在这个问题,我自己的理解是训练用bf16,推理用fp16本身就存在一定的精度误差,指数位bf16是高于fp16的,最后比如计算attention的matrix multiply时会导致这个错误,但是我看到internlm官方的代码也是用torch.float16,所以想请教下这个问题
Environment
官方镜像
Other information
No response