axolotl
axolotl copied to clipboard
RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Generate the correct Lora after training is completed
Current behaviour
Using the following command to merge models, there is an error message:
python3 -m axolotl.cli.merge_lora sft_34b.yml --lora_model_dir="/workspace/axolotl/output/Yi-34B/ljf-yi-34b-lora" --output_dir=/data1/ljf2/data-check-test
Steps to reproduce
The meaning of this parameter is not effective save_safetensors: true Actually generated adapter_model.bin
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
is the issue that the merge doesn't work? or that specifying save_safetensors produces a Pytorch_model.bin? or both?
specifying save_safetensors produces a Pytorch_model.bin
Using the following command to merge models, there is an error message:
Hey, seems like the PeftModel loading failed. Can you check the files in lora_model_dir
are valid (aren't a few KBs)?
Did you run out of space during training?
Since the training had a few checkpoints, could you try pointing the model_dir to one of those and see what happens?
The meaning of this parameter is not effective save_safetensors: true
I'm a bit confused at this as the you said the merge failed.
Now I am continuing SFT training from checkpoint and reporting this error again
I have configured this parameter: use_reentrant: true resume_from_checkpoint: /workspace/axolotl-main/checkpoint-5865
resuming from a "peft checkpoint" is not the same as resuming from a regular checkpoint. You'll want to set lora_model_dir
to point to the checkpoint directory iirc. @NanoCode012 does that sound right?