llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

Peft stored checkpoint is corrupted

Open hassanzadeh opened this issue 1 year ago • 6 comments

System Info

Setting: Pytorch Nightly, Cuda 11.8, GPU: 8xA100

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

Hey Guys, I'm using the finetuning script to train a Llama 7B model on some simple task with Peft using Bfloat16 dtype. At the end of each epoch, the model.save_pretrained() will store the Peft. I see that the save_pretrained function is called on all ranks, does that mean the save pretrained function takes cares of the distributed saving? If that's the case then why do we get a single file (ie, adapter_model.bin) rather than multiple shards?

Regardless, this seems to result in a corrupted binary file most often, that means perhaps out of 10 epochs, the file stored from only one epoch is healthy and I can load it with PeftModel.from_pretrained, and for the other times, it crashes with the following error, any thoughts why this is happening?

Error logs

Screenshot 2023-09-14 at 1 39 39 PM

Expected behavior

The stored peft weight file is corrupted more than not.

hassanzadeh avatar Sep 14 '23 17:09 hassanzadeh

@hassanzadeh thanks for opening this issue, if you are using PEFT+FSDP, need to save on all ranks. However, we are always saving best eval score checkpoints, so at the end you will have one checkpoint. ( we may reconsider this to save on every n epoch as well)

Can you pls elaborate a bit more, I couldn't repro this issue, you train with FSDP+PEFT then loading back the PEFT model fails with above error?

HamidShojanazeri avatar Sep 17 '23 21:09 HamidShojanazeri

Thanks @HamidShojanazeri :) Yes I'm indeed using PEFT+FSDP, right, but there is one thing that made me wonder if something is going wrong when we save the model, it frequently happens that the config file is corrupted (see the screenshot) below. That suggests that all (say the 8) threads start writing the same content to the same file and lead to this; which made me wonder if this is indeed happening to the stored weights too. The other thing I was thinking about is if each process stores it's own part of the PEFT weights why don't we have have them sharded (ie, having 8 weight files instead of 1), how is it possible to have 8 process writing into the same file at the same time (though I understand that is more a pytorch question)? Thanks again

image

hassanzadeh avatar Sep 18 '23 11:09 hassanzadeh

Besides that I don't think the way this code is storing the model is right. Pytorch has a bug in storing FSDP checkpoints and I don't think the bug is resolved. the checkpoints being stored are wrong. There is a workaround but for peft model we need to merge and unload the model which will cause a crash if we do that, not sure how that can be resolved.

hassanzadeh avatar Sep 18 '23 19:09 hassanzadeh

I am having a similar issue. I am using 2 A10 GPUs. The scripts work fine with the 7B model. I can train, save, and load the models fine. But the 13B model does not save correctly using PEFT (Lora) and FSDP. I can save the model before any training is performed, and it creates a 25.6 GB adapter_model.bin file. This file can be loaded correctly. But once I start training, the adapter_model.bin file comes out 14.3 GB and can't be loaded. I have never gotten a valid file with the 13B model.

pcardill0 avatar Oct 04 '23 14:10 pcardill0

My issue was solved in issue #113 which requires CPU memory to store the entire model before saving.

pcardill0 avatar Oct 04 '23 17:10 pcardill0

Even I am having the same issue @HamidShojanazeri

hithesh-sankararaman avatar Oct 27 '23 10:10 hithesh-sankararaman