llama-cookbook icon indicating copy to clipboard operation
llama-cookbook copied to clipboard

Model Checkpoint NOT saved, Eval Loss "Inf"

Open shailja-thakur opened this issue 2 years ago • 3 comments

System Info

torchrun
--nproc_per_node 3
llama_finetuning.py
--enable_fsdp
--low_cpu_fsdp
--use_peft
--dataset custom_dataset
--peft_method lora
--model_name /Llama-2-13b-chat-hf
--pure_bf16
--dist_checkpoint_root_folder /dist_chkpt_root_folder
--output_dir /llama-recipes/models"

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

The following is a snippet of the last few lines of the log files for the llama-2-13b-chat fine-tuned on the custom dataset, The training finished; however, the eval loss shows to be inf

And the model checkpoint was not saved, and I did not see any error message either. Any help will be appreciated. I spent four days of computing on training, and it got wasted.

........

Epoch 3: train_perplexity=1.3212, train_epoch_loss=0.2786, epcoh time 112076.98777786875s Key: avg_train_prep, Value: 1.387495994567871 Key: avg_train_loss, Value: 0.3265116214752197 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 110888.28061801738 Key: avg_checkpoint_time, Value: 0.00011723674833774567

Error logs

And the model checkpoint was not saved, and I did not see any error message either. Any help will be appreciated. I spent four days of computing on training, and it got wasted.

........

Epoch 3: train_perplexity=1.3212, train_epoch_loss=0.2786, epcoh time 112076.98777786875s Key: avg_train_prep, Value: 1.387495994567871 Key: avg_train_loss, Value: 0.3265116214752197 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 110888.28061801738 Key: avg_checkpoint_time, Value: 0.00011723674833774567

Expected behavior

The model checkpoint should have been saved in the mentioned output directory

shailja-thakur avatar Sep 04 '23 03:09 shailja-thakur

Hi, the eval loss being inf will prevent saving the checkpoint as we compare against an initial best eval loss of inf Comparison is here Initial best eval value here To better investigate the issue you can deactivate the comparison with the best value in the source file to get a checkpoint and then run an eval manual to figure out why your eval loss is inf on your data.

mreso avatar Sep 06 '23 18:09 mreso

I tried commenting out the valid loss, and the training finished, however when saving the checkpoint, OOM was raised,

evaluating Epoch: 100%|██████████| 1/1 [00:12<00:00, 12.43s/it] eval_ppl=tensor(1.5420, device='cuda:0') eval_epoch_loss=tensor(0.4330, device='cuda:0') we are about to save the PEFT modules Traceback (most recent call last): File "/../llama-recipes/llama_finetuning.py", line 253, in fire.Fire(main) ................................................. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 2.06 MiB is free. Including non-PyTorch memory, this process has 79.13 GiB memory in use. Of the allocated memory 77.91 GiB is allocated by PyTorch, and 187.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:1438 (most recent call first):

shailja-thakur avatar Sep 20 '23 04:09 shailja-thakur

any solution now? Same problem

iisleepalot avatar Nov 01 '23 08:11 iisleepalot