llama-recipes
llama-recipes copied to clipboard
Corrupted model when saving checkpoints
System Info
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.25.0
[pip3] torch==2.0.1+cu118
[conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] numpy 1.25.0 pypi_0 pypi
[conda] torch 2.0.1+cu118 pypi_0 pypi
Information
- [ ] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
I have finetuned Llama 2 with some datasets for several epochs using LoRA and I modified the script to save a checkpoint after each epoch with:
model.save_pretrained(checkpoint_dir)
To evaluate my model, I wanted to load each checkpoint and do some inference with it, I tried with the HugginFace Peft library to load the model and the adapter:
model = PeftModel.from_pretrained(model_dir, lora_apply_dir)
as well as with the torch.load(lora_apply_dir + "adapter_model.bin")
And for the majority of the checkpoints (not all, some are loading fine), I have this error:
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted
The bin files created for each checkpoints have exactly the same size and seem complete so I don't understand why it is working correctly for some checkpoints and not for others...
Error logs
File [~/miniconda3/envs/megatron/lib/python3.10/site-packages/peft/peft_model.py:203] in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, **kwargs)
201 else:
202 model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config, adapter_name)
--> 203 model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
204 return model
...
File [~/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/serialization.py:283] in _open_zipfile_reader.__init__(self, name_or_buffer)
282 def __init__(self, name_or_buffer) -> None:
--> 283 super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted
### Expected behavior
Load the adapter correctly
@clechristophe couple of notes, first make sure in your save_dir you see adapter_config.json adapter_model.bin
after fine-tuning. Also, make sure you first load the base model separately then use
model = LlamaForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(model, lora_apply_dir)
I've got the same issue. Has anyone been able to pin down the root cause of this? I had no issues with saving/reloading PEFT models for the 7b chat model but am unable to reload PEFT models that I saved for the 13b chat model.
I also noticed that HF must be internalizing the logic for which rank is responsible for saving PEFT checkpoints (we call save_pretrained
on all workers, not just rank 0). Is there something in there that's causing issues?
Quick update on this: I'm seeing no issues when I train on up to and including 10 GPUs. As soon as I bump up to 11 GPUs or higher, I see the issue reported above. This appears to be a consistent threshold across a few different configurations that I tried varying the model (e.g. 7b vs. 13b and even GPT2 as a sanity check). It makes me wonder if the workers are stepping on each other's toes when saving or if there is some issue with divisibility of the parameter count by the number of workers.
Please let me know if you have any thoughts on this. I've got a 16 GPU machine on Google Cloud and am in the middle of a rather large experiment and it's unfortunate to be idling 6 of these GPUs for my PEFT runs.
Edit 2: this may have to do with the number attention heads in these models. For example, I get the following error when I try to load the 13b model into text-generation-inference
using 16 GPUs.
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 40 and `num_shards`: 16
I've got the same issue. Has anyone been able to pin down the root cause of this? I had no issues with saving/reloading PEFT models for the 7b chat model but am unable to reload PEFT models that I saved for the 13b chat model.
I also noticed that HF must be internalizing the logic for which rank is responsible for saving PEFT checkpoints (we call
save_pretrained
on all workers, not just rank 0). Is there something in there that's causing issues?Quick update on this: I'm seeing no issues when I train on up to and including 10 GPUs. As soon as I bump up to 11 GPUs or higher, I see the issue reported above. This appears to be a consistent threshold across a few different configurations that I tried varying the model (e.g. 7b vs. 13b and even GPT2 as a sanity check). It makes me wonder if the workers are stepping on each other's toes when saving or if there is some issue with divisibility of the parameter count by the number of workers.
Please let me know if you have any thoughts on this. I've got a 16 GPU machine on Google Cloud and am in the middle of a rather large experiment and it's unfortunate to be idling 6 of these GPUs for my PEFT runs.
Edit 2: this may have to do with the number attention heads in these models. For example, I get the following error when I try to load the 13b model into
text-generation-inference
using 16 GPUs.ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 40 and `num_shards`: 16
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from peft import get_peft_model_state_dict
fullstate_save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
with FSDP.state_dict_type(
model, StateDictType.FULL_STATE_DICT, fullstate_save_policy
):
cpu_state = get_peft_model_state_dict(model, model.state_dict())
if rank == 0:
print("Save peft model in checkpoint!")
save_directory = os.path.join(ckpt_dir, "peft")
os.makedirs(save_directory, exist_ok=True)
torch.save(cpu_state, os.path.join(save_directory, "adapter_model.bin"))
model.peft_config["default"].save_pretrained(save_directory)
Hi, I encountered a similar issue when trying to save the PEFT checkpoint. I have found that the above code can successfully save the PEFT checkpoint on my end, although it will consume more RAM. This is because it will dump the all weights onto the CPU and then select the trainable parameters to save.
Thanks so much @newsbreakDuadua9, this was a huge help, and clearly a nice solution that makes use of the FSDP saving facilities. I don't mind the CPU usage for now (I just want correctness!) though I see your point and it might be nice to be able to select which parameters you want to stream back to rank0, as opposed to recalling all parameters.
Thanks, this solved my issue. I had to add the following to my script: from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
@pcardill0 Thank you for reminding me of this missing import; I have updated my comment.
I was having similar problem but using the above code was causing timeout, I changed it to
if train_config.use_peft:
if not train_config.enable_fsdp or rank == 0:
print(f"we are about to save the PEFT modules")
save_dir = f'{train_config.output_dir}/epoch_{epoch+1}'
if train_config.enable_fsdp and train_config.fsdp_peft_cpu_offload_for_save:
print(f"we are about to save using state_dict_type")
full_state_dict_config = FullStateDictConfig(offload_to_cpu=False, rank0_only=True)
with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, full_state_dict_config):
model.save_pretrained(save_dir)
else:
model.save_pretrained(save_dir)
if not train_config.enable_fsdp or rank == 0:
print(f"PEFT modules are saved in {save_dir} directory")
I was having similar problem but using the above code was causing timeout, I changed it to
if train_config.use_peft: if not train_config.enable_fsdp or rank == 0: print(f"we are about to save the PEFT modules") save_dir = f'{train_config.output_dir}/epoch_{epoch+1}' if train_config.enable_fsdp and train_config.fsdp_peft_cpu_offload_for_save: print(f"we are about to save using state_dict_type") full_state_dict_config = FullStateDictConfig(offload_to_cpu=False, rank0_only=True) with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, full_state_dict_config): model.save_pretrained(save_dir) else: model.save_pretrained(save_dir) if not train_config.enable_fsdp or rank == 0: print(f"PEFT modules are saved in {save_dir} directory")
This might not work for the 70B models
I was having similar problem but using the above code was causing timeout, I changed it to
if train_config.use_peft: if not train_config.enable_fsdp or rank == 0: print(f"we are about to save the PEFT modules") save_dir = f'{train_config.output_dir}/epoch_{epoch+1}' if train_config.enable_fsdp and train_config.fsdp_peft_cpu_offload_for_save: print(f"we are about to save using state_dict_type") full_state_dict_config = FullStateDictConfig(offload_to_cpu=False, rank0_only=True) with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, full_state_dict_config): model.save_pretrained(save_dir) else: model.save_pretrained(save_dir) if not train_config.enable_fsdp or rank == 0: print(f"PEFT modules are saved in {save_dir} directory")
@kiamesdavies Could you specify the module generating the timeout error and provide the exact error message?
Hi! It seems that a solution has been provided to the issue and there has not been a follow up conversation for a long time. I will close this issue for now and feel free to reopen it if you have any questions!