Traceback (most recent call last):
File "/finetune/adapter.py", line 305, in
CLI(setup)
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "finetune/adapter.py", line 74, in setup
fabric.launch(main, data_dir, checkpoint_dir, out_dir)
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 759, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 841, in _wrap_and_launch
return to_run(*args, **kwargs)
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 846, in _wrap_with_setup
return to_run(*args, **kwargs)
File "/finetune/adapter.py", line 111, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, max_seq_length, speed_monitor)
File "/finetune/adapter.py", line 195, in train
val_loss = validate(fabric, model, val_data, tokenizer, max_seq_length)
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/finetune/adapter.py", line 227, in validate
output = generate(
File ".../anaconda3/envs/inspect/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Inspect/generate/base.py", line 75, in generate
idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
I had this error when I finetuned the RedPajama-INCITE-Base-3B-v1 with the dolly dataset. Do you know how to fix it?
My pytorch is "2.1.0.dev20230523+cu117".
Try passing --precision bf16-mixed
or --precision 16-mixed
. I just made the switch in the default with #175
I met the similar problems, and after I pass --precision 16-mixed, the problem still exist @carmocca
How about your loss values? Do they become nan after step 1? I believe this might be related to the loss nan problem.
@shuwang127 The first loss is nan, and it soon become OOM
Solved when I changed to A100
@shuwang127 You use A100 to fine-tune a 3B model?