llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

Corrupted model when saving checkpoints

Open clechristophe opened this issue 1 year ago • 9 comments

System Info

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.25.0 [pip3] torch==2.0.1+cu118 [conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] numpy 1.25.0 pypi_0 pypi [conda] torch 2.0.1+cu118 pypi_0 pypi

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

I have finetuned Llama 2 with some datasets for several epochs using LoRA and I modified the script to save a checkpoint after each epoch with:

model.save_pretrained(checkpoint_dir)

To evaluate my model, I wanted to load each checkpoint and do some inference with it, I tried with the HugginFace Peft library to load the model and the adapter: model = PeftModel.from_pretrained(model_dir, lora_apply_dir)

as well as with the torch.load(lora_apply_dir + "adapter_model.bin")

And for the majority of the checkpoints (not all, some are loading fine), I have this error:

RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

The bin files created for each checkpoints have exactly the same size and seem complete so I don't understand why it is working correctly for some checkpoints and not for others...

Error logs

File [~/miniconda3/envs/megatron/lib/python3.10/site-packages/peft/peft_model.py:203] in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, **kwargs)
    201 else:
    202     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config, adapter_name)
--> 203 model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
    204 return model
...
File [~/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/serialization.py:283] in _open_zipfile_reader.__init__(self, name_or_buffer)
    282 def __init__(self, name_or_buffer) -> None:
--> 283     super().__init__(torch._C.PyTorchFileReader(name_or_buffer))

RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

### Expected behavior

Load the adapter correctly

clechristophe avatar Aug 10 '23 06:08 clechristophe

@clechristophe couple of notes, first make sure in your save_dir you see adapter_config.json adapter_model.bin after fine-tuning. Also, make sure you first load the base model separately then use

model = LlamaForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(model, lora_apply_dir)

HamidShojanazeri avatar Aug 10 '23 16:08 HamidShojanazeri

I've got the same issue. Has anyone been able to pin down the root cause of this? I had no issues with saving/reloading PEFT models for the 7b chat model but am unable to reload PEFT models that I saved for the 13b chat model.

I also noticed that HF must be internalizing the logic for which rank is responsible for saving PEFT checkpoints (we call save_pretrained on all workers, not just rank 0). Is there something in there that's causing issues?

Quick update on this: I'm seeing no issues when I train on up to and including 10 GPUs. As soon as I bump up to 11 GPUs or higher, I see the issue reported above. This appears to be a consistent threshold across a few different configurations that I tried varying the model (e.g. 7b vs. 13b and even GPT2 as a sanity check). It makes me wonder if the workers are stepping on each other's toes when saving or if there is some issue with divisibility of the parameter count by the number of workers.

Please let me know if you have any thoughts on this. I've got a 16 GPU machine on Google Cloud and am in the middle of a rather large experiment and it's unfortunate to be idling 6 of these GPUs for my PEFT runs.

Edit 2: this may have to do with the number attention heads in these models. For example, I get the following error when I try to load the 13b model into text-generation-inference using 16 GPUs.

ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 40 and `num_shards`: 16

ToddMorrill avatar Sep 05 '23 02:09 ToddMorrill

I've got the same issue. Has anyone been able to pin down the root cause of this? I had no issues with saving/reloading PEFT models for the 7b chat model but am unable to reload PEFT models that I saved for the 13b chat model.

I also noticed that HF must be internalizing the logic for which rank is responsible for saving PEFT checkpoints (we call save_pretrained on all workers, not just rank 0). Is there something in there that's causing issues?

Quick update on this: I'm seeing no issues when I train on up to and including 10 GPUs. As soon as I bump up to 11 GPUs or higher, I see the issue reported above. This appears to be a consistent threshold across a few different configurations that I tried varying the model (e.g. 7b vs. 13b and even GPT2 as a sanity check). It makes me wonder if the workers are stepping on each other's toes when saving or if there is some issue with divisibility of the parameter count by the number of workers.

Please let me know if you have any thoughts on this. I've got a 16 GPU machine on Google Cloud and am in the middle of a rather large experiment and it's unfortunate to be idling 6 of these GPUs for my PEFT runs.

Edit 2: this may have to do with the number attention heads in these models. For example, I get the following error when I try to load the 13b model into text-generation-inference using 16 GPUs.

ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 40 and `num_shards`: 16
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from peft import get_peft_model_state_dict
fullstate_save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
with FSDP.state_dict_type(
    model, StateDictType.FULL_STATE_DICT, fullstate_save_policy
):
    cpu_state = get_peft_model_state_dict(model, model.state_dict())
if rank == 0:
    print("Save peft model in checkpoint!")
    save_directory = os.path.join(ckpt_dir, "peft")
    os.makedirs(save_directory, exist_ok=True)
    torch.save(cpu_state, os.path.join(save_directory, "adapter_model.bin"))
    model.peft_config["default"].save_pretrained(save_directory)

Hi, I encountered a similar issue when trying to save the PEFT checkpoint. I have found that the above code can successfully save the PEFT checkpoint on my end, although it will consume more RAM. This is because it will dump the all weights onto the CPU and then select the trainable parameters to save.

newsbreakDuadua9 avatar Sep 06 '23 22:09 newsbreakDuadua9

Thanks so much @newsbreakDuadua9, this was a huge help, and clearly a nice solution that makes use of the FSDP saving facilities. I don't mind the CPU usage for now (I just want correctness!) though I see your point and it might be nice to be able to select which parameters you want to stream back to rank0, as opposed to recalling all parameters.

ToddMorrill avatar Sep 07 '23 03:09 ToddMorrill

Thanks, this solved my issue. I had to add the following to my script: from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

pcardill0 avatar Oct 04 '23 17:10 pcardill0

@pcardill0 Thank you for reminding me of this missing import; I have updated my comment.

newsbreakDuadua9 avatar Oct 16 '23 21:10 newsbreakDuadua9

I was having similar problem but using the above code was causing timeout, I changed it to

if train_config.use_peft:
    if not train_config.enable_fsdp or rank == 0:
        print(f"we are about to save the PEFT modules")
    
    save_dir = f'{train_config.output_dir}/epoch_{epoch+1}'
    if train_config.enable_fsdp and train_config.fsdp_peft_cpu_offload_for_save:
        print(f"we are about to save using state_dict_type")
        full_state_dict_config = FullStateDictConfig(offload_to_cpu=False, rank0_only=True)
        with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, full_state_dict_config):
            model.save_pretrained(save_dir)
        
    else:
        model.save_pretrained(save_dir)

    if not train_config.enable_fsdp or rank == 0:
        print(f"PEFT modules are saved in {save_dir} directory")         

kiamesdavies avatar Oct 31 '23 19:10 kiamesdavies

I was having similar problem but using the above code was causing timeout, I changed it to

if train_config.use_peft:
    if not train_config.enable_fsdp or rank == 0:
        print(f"we are about to save the PEFT modules")
    
    save_dir = f'{train_config.output_dir}/epoch_{epoch+1}'
    if train_config.enable_fsdp and train_config.fsdp_peft_cpu_offload_for_save:
        print(f"we are about to save using state_dict_type")
        full_state_dict_config = FullStateDictConfig(offload_to_cpu=False, rank0_only=True)
        with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, full_state_dict_config):
            model.save_pretrained(save_dir)
        
    else:
        model.save_pretrained(save_dir)

    if not train_config.enable_fsdp or rank == 0:
        print(f"PEFT modules are saved in {save_dir} directory")         

This might not work for the 70B models

kiamesdavies avatar Oct 31 '23 19:10 kiamesdavies

I was having similar problem but using the above code was causing timeout, I changed it to

if train_config.use_peft:
    if not train_config.enable_fsdp or rank == 0:
        print(f"we are about to save the PEFT modules")
    
    save_dir = f'{train_config.output_dir}/epoch_{epoch+1}'
    if train_config.enable_fsdp and train_config.fsdp_peft_cpu_offload_for_save:
        print(f"we are about to save using state_dict_type")
        full_state_dict_config = FullStateDictConfig(offload_to_cpu=False, rank0_only=True)
        with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, full_state_dict_config):
            model.save_pretrained(save_dir)
        
    else:
        model.save_pretrained(save_dir)

    if not train_config.enable_fsdp or rank == 0:
        print(f"PEFT modules are saved in {save_dir} directory")         

@kiamesdavies Could you specify the module generating the timeout error and provide the exact error message?

newsbreakDuadua9 avatar Oct 31 '23 20:10 newsbreakDuadua9

Hi! It seems that a solution has been provided to the issue and there has not been a follow up conversation for a long time. I will close this issue for now and feel free to reopen it if you have any questions!

wukaixingxp avatar May 31 '24 16:05 wukaixingxp