alpaca-lora
alpaca-lora copied to clipboard
fix issues to be compatible with latest peft
What does this PR do?
- Fixes the issues https://github.com/huggingface/peft/issues/286 and https://github.com/huggingface/peft/issues/317
- Adds Callback to be used with HF Trainer to make sure intermediate checkpoints are saving only peft adapters, solving https://github.com/huggingface/peft/issues/286#issuecomment-1506850018
- Adds a
debug_mode
arg to quickly test out the fine-tune script on tiny subset of dataset
Your pull request isn't working. Crashed when it tries to save a checkpoint, was training on 8x RTX 3090.
@lksysML if you're training in 8bit, there's a separate bug with bitsandbytes
latest version that causes a massive VRAM spike when saving files. Can bypass it by backdating: pip install bitsandbytes==0.37.2
@lksysML if you're training in 8bit, there's a separate bug with
bitsandbytes
latest version that causes a massive VRAM spike when saving files. Can bypass it by backdating:pip install bitsandbytes==0.37.2
Yea, I ran into that issue yesterday and fixed it already. I think that today's run has something to do with peft, I rolled back to an old version and it didn't crash. WandB report of crash with your pull request: https://wandb.ai/lksy/huggingface/runs/hxq051bg/log
Also, debug_mode doesn't work. I set it to True and it continued with the whole training set instead of the 1024 examples as it is supposed to.
Hello, this PR works fine for me, for debug mode you have to specify --debug_mode
.
wandb: https://wandb.ai/smangrul/huggingface/runs/3cxt4cnc?workspace=user-smangrul
with bnb==0.37, the vram is stable
@pacman100
I'm a bit confused about the intended functionality here.
In that, when attempting to train a new adapter after loading the weights of the old one, I get bad results (mainly because of my small dataset size, which is a requirement for my implementation).
I was able to get good results by continuing the training of a previous adapter instead (and which was the functionality that broke).
Now, we save the adapter model in a new folder
We delete the pytorch_model.bin (which is the full weights, not something the OJ code ever needed, which makes sense. It was saving adapter_model.bin as pytorch_model.bin before)
But then the resume_from_checkpoint code is still searching for pytorch_model.bin, which can't exist.
I can move the adapter out of the folder and rename it, in an attempt to get the previous functionality, but then
It's not actually fixed. And the conflicting nature of this PR's code tells me that this functionality wasn't tested anyways.
I think it's pretty big issue, because the alternative is retraining the adapter in full every single time I get new data, which is actually pretty time and power consuming compared to the alternative.
@ElleLeonne
I did test and it does work
Continuing from the ckpt of the above run via:
python finetune.py --base_model 'path/to/it' --data_path 'yahma/alpaca-cleaned' --output_dir './lora-alpaca-2' --debug_mode --num_epochs=3 --cutoff_len=512 --lora_target_modules="[q_proj,k_proj, v_proj, o_proj]" --lora_r=16 --micro_batch_size=8 --resume_from_checkpoint ./lora-alpaca/checkpoint-20/adapter_model/
results in it being correctly loaded:
In the above run, the eval loss at the end was 1.70. In the run here using above command: https://wandb.ai/smangrul/huggingface/runs/18ux1bhz?workspace=user-smangrul, the eval loss starts at 1.65 and ends at 1.48
@ElleLeonne
I did test and it does work
@pacman100
I realize this isn't your repo, so I appreciate you taking the time to investigate this and help improve our little project here.
Using his method will cause the code to perform the same functionality as loading the adapter, then starting a new training run. This is because this block
will never execute, now that we're removing the pytorch_model.bin. (previously, there was a binding script that would save the lora_adapter's state dict as pytorch_model.bin. It was the same file, just named differently.) I can attempt to preserve this functionality by moving the adapter to the main folder and renaming it, but then I encounter the previous error.
As a result,
the trainer will literally never try to load the state of the old training session. It will instantiate a new training session, using a new instance of warmup steps, new training loss calculations, and a new starting point for the adapter. It functions more akin to merging the weights with the model and starting a new training run, if I understand the process correctly. Even playing with the hyperparameters, dropping batch size, killing warmup steps, tweaking learning rate, it always causes the new adapter to overfit like mad due to my small dataset size.
Which I guess is just to say that this PR doesn't restore the old functionality of the code. It just cuts out half of what the old code used to do. It would be really nice to be able to load the trainer's state when continuing training again. Which I suppose it what broke in the first place.
Er... the logic of the code there is:
- if pytorch_model.bin exists, then:
- Load the checkpoint file and use it
-
resume_from_checkpoint
isTrue
, tellingtrainer.Train
to load the file again (erroneously loaded twice in pre-existing code)
- if it does not exist, then:
- set
resume_from_checkpoint
toFalse
to telltrainer.train
that it should not load any file - check if
adapter_model.bin
exists, if it does:- load the adapter_model file and use it
- train based on that
- else:
- train a new model
- set
the torch.load
+ set_peft_model_state_dict
lines are what enable resuming a prior checkpoint. the resume_from_checkpoint
being sent to trainer.Train
is a pre-existing code mistake.
For reference, train
's documentation for the resume_from_checkpoint
arg:
resume_from_checkpoint (
str
orbool
, optional): If astr
, local path to a saved checkpoint as saved by a previous instance of [Trainer
]. If abool
and equalsTrue
, load the last checkpoint in args.output_dir as saved by a previous instance of [Trainer
]. If present, training will resume from the model/optimizer/scheduler states loaded here.
The loading of adapter_model.bin
(ie the correct file that is intended to be used) means resuming from prior saves (of adapter_model.bin
) does in fact work.
resume_from_checkpoint is True, telling trainer.Train to load the file again (erroneously loaded twice in pre-existing code)
I don't think this is true. There was a very clear marked difference in quality with the old, working version of resume_from_checkpoint, and the current version of resuming from just the adapter. It wasn't just a coding mistake.
The loading of adapter_model.bin (ie the correct file that is intended to be used) means resuming from prior saves (of adapter_model.bin) does in fact work.
I know it works. My point was that this is only half the required solution. I was able to get this solution working myself without the PR. The results are just bad.
https://github.com/tloen/alpaca-lora/pull/154#discussion_r1147950135 See the original PR for the update in question, too.
@ElleLeonne,
this gist https://gist.github.com/pacman100/8e7a6eedabf34e1a88dd74a96c3b619f should exhibit the behaviour that you are looking for.
But, it doesn't make much sense to me, could you provide a concrete example of how the existing code with previous peft was being used and how this PR is failing to do that with some concrete metrics?
- due to linear decay, the final ckpt of existing lora ckpt will have learning rate close to 0 and so what you were earlier doing was essentially training with very low lr
- IF you were adding the new small dataset to the initial dataset to increase the number of steps and loading the previous ckpt for continuation, then it would have been randomly shuffled and the number of steps till the previous ckpt would have been skipped which most likely would contain many samples from the new small dataset.
Both the above points are a weird way of continuing training.
I appreciate that. I'll play around with it and see how everything works out.
The idea was basically to introduce the new data as a new "epoch", which helped include the relevant data without needing to fully retrain the adapter every night (which takes about 7 hours for me). Then I can delay needing to spin the whole adapter retraining up for a longer period of time, while still showing noticeable improvements day-to-day. It also fixed a major problem where sometimes training would fail in the middle of the night, and I'd need to use the old adapter until I could fix the problem.
I would increase the epoch from (let's say) 5 to 10, so the learning rate decay would still exist, but not be 0.
I dont know if this is the RIGHT way, but this simple modification at L275 produces a adapter_model.bin
with the right size:
- model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict())