InternEvo [Bug] 训练bf16 infer fp16出现NaN

[Bug] 训练bf16 infer fp16出现NaN

Open Cerberous opened this issue 1 year ago • 0 comments

Describe the bug

我来重新描述一下我的问题，我在用internevo训练的时候用的bf16，然后转换成hf后用fp16推理遇到了下述报错

Traceback (most recent call last):
  File "/InternLM/hf_test.py", line 15, in <module>
    output = model.generate(**inputs, **gen_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2734, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

这个错误是由于model load进来的时候torch_dtype的，如果我改成torch_type=torch.bfloat16或者torch.float32都是没有问题的，但是torch.float16会存在这个问题，我自己的理解是训练用bf16，推理用fp16本身就存在一定的精度误差，指数位bf16是高于fp16的，最后比如计算attention的matrix multiply时会导致这个错误，但是我看到internlm官方的代码也是用torch.float16，所以想请教下这个问题

Environment

官方镜像

Other information

No response

Jul 16 '24 09:07 Cerberous

InternEvo InternEvo copied to clipboard

[Bug] 训练bf16 infer fp16出现NaN

Describe the bug

Environment

Other information

InternEvo
InternEvo copied to clipboard