litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Fine-tuning error after the training: probability tensor contains either inf', "nan or element < 0`

Open cosmin-z opened this issue 1 year ago • 2 comments

Hi,

I am trying to do a fine-tuning based on this guide: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_adapter.md.

I've lunched this command: python finetune/adapter.py \ --data_dir data/mydata/ \ --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b \ --out_dir data/mydata-finetuned

with precision : --precision 16-true

And I got this error:

iter 9599 step 600: loss nan, iter time: 65.33ms (optimizer.step) Validating Recommend a movie for me to watch during the weekend and explain the reason. Traceback (most recent call last): File "/home/c zaharia/lit-gpt/finetune/adapter.py", line 299, in <module> CLI (setup) File "/opt /conda/lib/python3.10/site-packages/jsonargparse/_cli.py",line 85, in CLI return run component (component, cfg init) File " /opt/conda/lib/python3.10/site-packages/jsonargparse/_cli.py",line147,in_run_component return component (**c§g) File "/home/c _zaharia/lit-gpt/finetune/adapter.py", line 77, in setup fabric. launch (main, data dir, checkpoint dir, out dir) File "/opt /conda/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 781, in launch return self. wrap and launch(function, self, *args,**kwargs) <br> File "/opt/conda/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 863, in _wrap_and_launch return to run (*args, **kwargs) File "/opt /conda/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 868, in _wrap with setup return to run (*args, **kwargs) File "/home/c zaharia/lit-gpt/finetune/adapter.py", line 115, in main train (fabric, model, optimizer, train data, val_data, checkpoint_dir, out_ dir, speed monitor) File "/home/c _zaharia/lit-gpt/finetune/adapter.py", line 202, in train val loss = validate (fabric, model, val data, tokenizer, longest seq length) File W/opt/conda/lib/python3.10/site-packages/torch/utils/contextlib.py",line115,indecoratecontext return func (*args, **kwargs) File "/home/c zaharia/lit-gpt/finetune/adapter.py", line 233, in validate output = generate ( File "/opt /conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func (*args, **kwargs) File "/home/c zaharia/lit-gpt/generate/base.py", line 75, in generate idx next = torch. multinomial (probs, num samples=1) .to (type=type) runt imeError: probability tensor contains either inf', "nan or element < 0

I don't understand how to solve this problem, thanks for your help

cosmin-z avatar Jun 30 '23 12:06 cosmin-z

Fixed by #1185

ansh avatar Sep 11 '23 08:09 ansh