litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

probability tensor contains either `inf`, `nan` or element < 0

Open shuwang127 opened this issue 1 year ago • 1 comments

Traceback (most recent call last): File "/finetune/adapter.py", line 305, in CLI(setup) File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI return _run_component(component, cfg_init) File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component return component(**cfg) File "finetune/adapter.py", line 74, in setup fabric.launch(main, data_dir, checkpoint_dir, out_dir) File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 759, in launch return self._wrap_and_launch(function, self, *args, **kwargs) File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 841, in _wrap_and_launch return to_run(*args, **kwargs) File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 846, in _wrap_with_setup return to_run(*args, **kwargs) File "/finetune/adapter.py", line 111, in main train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, max_seq_length, speed_monitor) File "/finetune/adapter.py", line 195, in train val_loss = validate(fabric, model, val_data, tokenizer, max_seq_length) File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/finetune/adapter.py", line 227, in validate output = generate( File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/Inspect/generate/base.py", line 75, in generate idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype) RuntimeError: probability tensor contains either inf, nan or element < 0

I had this error when I finetuned the RedPajama-INCITE-Base-3B-v1 with the dolly dataset. Do you know how to fix it? My pytorch is "2.1.0.dev20230523+cu117".

shuwang127 avatar Jun 21 '23 14:06 shuwang127

Try passing --precision bf16-mixed or --precision 16-mixed. I just made the switch in the default with #175

carmocca avatar Jun 21 '23 16:06 carmocca

I met the similar problems, and after I pass --precision 16-mixed, the problem still exist @carmocca

MasterEndless avatar Jun 21 '23 22:06 MasterEndless

How about your loss values? Do they become nan after step 1? I believe this might be related to the loss nan problem.

shuwang127 avatar Jun 22 '23 02:06 shuwang127

@shuwang127 The first loss is nan, and it soon become OOM

MasterEndless avatar Jun 22 '23 03:06 MasterEndless

Solved when I changed to A100

shuwang127 avatar Jun 22 '23 16:06 shuwang127

@shuwang127 You use A100 to fine-tune a 3B model?

kannangce avatar Nov 01 '23 05:11 kannangce

Yes.

shuwang127 avatar Nov 01 '23 06:11 shuwang127