Loss is NaN when finetuning with FSDP
The same script settings on single device do not produce NaNs
Same issue. Any suggestions on the finetune learning rate or any settings to fix it?
iter 0 step 0: loss 1.7089, train time: 1409.17ms iter 1 step 0: loss 2.1155, train time: 1410.42ms iter 2 step 0: loss 1.9496, train time: 1410.41ms iter 3 step 0: loss 1.7309, train time: 1220.40ms iter 4 step 0: loss 1.7632, train time: 1410.42ms iter 5 step 0: loss 1.8084, train time: 1410.44ms iter 6 step 0: loss 1.8845, train time: 1410.35ms iter 7 step 0: loss 2.3051, train time: 1410.54ms iter 8 step 0: loss 1.7964, train time: 1410.41ms iter 9 step 0: loss 2.4444, train time: 940.32ms iter 10 step 0: loss 1.9692, train time: 1218.30ms iter 11 step 0: loss 2.3963, train time: 1397.22ms iter 12 step 0: loss 2.7146, train time: 1410.54ms iter 13 step 0: loss 1.9210, train time: 1410.36ms iter 14 step 0: loss 1.8052, train time: 1410.43ms iter 15 step 1: loss 2.0308, train time: 1489.61ms (optimizer.step) iter 16 step 1: loss nan, train time: 1405.01ms iter 17 step 1: loss nan, train time: 1222.38ms iter 18 step 1: loss nan, train time: 720.17ms iter 19 step 1: loss nan, train time: 1410.33ms iter 20 step 1: loss nan, train time: 1411.09ms
I still have this problem using the newest source code.
I am using INCITE-3b with dolly (taking about 8+GB) in 1 GPU (2080Ti) and 16-true precision.
I'm also having this issue when running the example alpaca dataset code from the tutorial: https://lightning.ai/pages/blog/falcon-a-guide-to-finetune-and-inference/