litgpt Loss nan while fine tuning Falcon7b

By following the same instruction provided for fine tuning falcon7b and by leaving all paramters as the defult ones, I could start fine tuning but after 60 iteration, loss is nan. Could anyone explains to me which might be the issue ? URGENT

Jun 15 '23 07:06 omarnj-lab

See my reply here https://github.com/Lightning-AI/lit-parrot/issues/140#issuecomment-1590337763

TL;DR: use --precision bf16-mixed

Jun 15 '23 12:06 carmocca

Hi @carmocca - Another potential issue (which I ran into) is that the loss is nan if all the tokens are masked out, which occurs if input is >= 2048 and MASK_INPUTS=True.

It might be helpful to trigger a warning in the script for this cases or add to README in case people are confused why loss is still nan after changing precision I

Jun 16 '23 09:06 griff4692

I'm also following the falcon 7b finetuning guide.

Using --precision bf16-mixed fixed my nan loss problem when training on an NVIDIA RTX A6000. however, after 80 steps it then runs into an OOM issue. I peeked at my VRAM usage and it started at about 29409MiB / 49140MiB and then jumped to 48205MiB / 49140MiB before finally dying from OOM.

Jun 20 '23 08:06 fozziethebeat

Closing, see https://github.com/Lightning-AI/lit-gpt/issues/159#issuecomment-1601122245 for more context on the memory usage

Jun 21 '23 16:06 carmocca