litgpt RuntimeError: probability tensor contains either inf, nan or element

I was able to do finetuning before but after the recent update I am getting this error:

RuntimeError: probability tensor contains either inf, nan or element < 0

The same error happens when I try to do inference with a previously fine tuned model.

Jun 23 '23 23:06 sajjadriaj

Did you pull the latest changes? What script did you run, what arguments did you pass? Did you make any changes to the script?

Jun 24 '23 17:06 carmocca

Sorry for not providing the details. I pulled the latest changes and ran the adapter_v2.py script. I did not change the script just changed the number of epochs. The fine tuning runs but after optimizer.step() the loss becomes nan. I tried reducing the learning rate but it does not help. My dataset size is very small.

Also I am training on 4 v100 so no bfloat16 support. I am running in 16 bit.

Jun 26 '23 21:06 sajjadriaj

I am running in 16 bit.

--precision 16-mixed or --precision 16-true?

Jun 30 '23 01:06 carmocca

16-true. The training runs fine and I am able to use the generate script. However I want to experiment multiple prompt and really want to try the chat/interactive mode. Since there is no script to try the fine tuned model in chat mode, i tried to add the adapter to the model in the chat script but it throws error that the probability tensor has inf or nan values. I tried to make the generate script interactive as well but same thing happens :(

Jun 30 '23 17:06 sajjadriaj

bf16-true will most likely fix it

Aug 14 '23 12:08 carmocca