llama-cookbook
llama-cookbook copied to clipboard
Model Checkpoint NOT saved, Eval Loss "Inf"
System Info
torchrun
--nproc_per_node 3
llama_finetuning.py
--enable_fsdp
--low_cpu_fsdp
--use_peft
--dataset custom_dataset
--peft_method lora
--model_name /Llama-2-13b-chat-hf
--pure_bf16
--dist_checkpoint_root_folder /dist_chkpt_root_folder
--output_dir /llama-recipes/models"
Information
- [ ] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
The following is a snippet of the last few lines of the log files for the llama-2-13b-chat fine-tuned on the custom dataset, The training finished; however, the eval loss shows to be inf
And the model checkpoint was not saved, and I did not see any error message either. Any help will be appreciated. I spent four days of computing on training, and it got wasted.
........
Epoch 3: train_perplexity=1.3212, train_epoch_loss=0.2786, epcoh time 112076.98777786875s Key: avg_train_prep, Value: 1.387495994567871 Key: avg_train_loss, Value: 0.3265116214752197 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 110888.28061801738 Key: avg_checkpoint_time, Value: 0.00011723674833774567
Error logs
And the model checkpoint was not saved, and I did not see any error message either. Any help will be appreciated. I spent four days of computing on training, and it got wasted.
........
Epoch 3: train_perplexity=1.3212, train_epoch_loss=0.2786, epcoh time 112076.98777786875s Key: avg_train_prep, Value: 1.387495994567871 Key: avg_train_loss, Value: 0.3265116214752197 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 110888.28061801738 Key: avg_checkpoint_time, Value: 0.00011723674833774567
Expected behavior
The model checkpoint should have been saved in the mentioned output directory
Hi, the eval loss being inf will prevent saving the checkpoint as we compare against an initial best eval loss of inf Comparison is here Initial best eval value here To better investigate the issue you can deactivate the comparison with the best value in the source file to get a checkpoint and then run an eval manual to figure out why your eval loss is inf on your data.
I tried commenting out the valid loss, and the training finished, however when saving the checkpoint, OOM was raised,
evaluating Epoch: 100%|██████████| 1/1 [00:12<00:00, 12.43s/it]
eval_ppl=tensor(1.5420, device='cuda:0') eval_epoch_loss=tensor(0.4330, device='cuda:0')
we are about to save the PEFT modules
Traceback (most recent call last):
File "/../llama-recipes/llama_finetuning.py", line 253, in
any solution now? Same problem