lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

[question] error message while finetuning

Open nevermet opened this issue 2 years ago • 2 comments

Dear all, I ran finetuning and while validating, I encountered this error message: iter 3198: loss nan, time: 123.08ms Validating ... ....... lit-llama/generate.py", line 74, in generate idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype) RuntimeError: probability tensor contains either inf, nan or element < 0

Could you tell me how I can solve this problem?

Thanks in advance.

nevermet avatar Oct 21 '23 04:10 nevermet

It may or may not be related, but are you using --precision 16-true? I noticed that for training some models it results in NaNs during training. If your GPU supports it, can you try brain float precision, i.e. --precision bf16-true?

rasbt avatar Oct 21 '23 13:10 rasbt

No I did not use --precision 16-true.

nevermet avatar Oct 23 '23 00:10 nevermet