Loss nan while fine tuning Falcon7b
By following the same instruction provided for fine tuning falcon7b and by leaving all paramters as the defult ones, I could start fine tuning but after 60 iteration, loss is nan. Could anyone explains to me which might be the issue ? URGENT
See my reply here https://github.com/Lightning-AI/lit-parrot/issues/140#issuecomment-1590337763
TL;DR: use --precision bf16-mixed
Hi @carmocca - Another potential issue (which I ran into) is that the loss is nan if all the tokens are masked out, which occurs if input is >= 2048 and MASK_INPUTS=True.
It might be helpful to trigger a warning in the script for this cases or add to README in case people are confused why loss is still nan after changing precision I
I'm also following the falcon 7b finetuning guide.
Using --precision bf16-mixed fixed my nan loss problem when training on an NVIDIA RTX A6000. however, after 80 steps it then runs into an OOM issue. I peeked at my VRAM usage and it started at about 29409MiB / 49140MiB and then jumped to 48205MiB / 49140MiB before finally dying from OOM.
Closing, see https://github.com/Lightning-AI/lit-gpt/issues/159#issuecomment-1601122245 for more context on the memory usage