LLaVA
LLaVA copied to clipboard
[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings
Describe the issue
I think the code is trying to resume_from_checkpoint like its a full-parameter fine-tunung checkpoint.
I have the same issue. Can anyone tell me how to fix it?
+1
+1
I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: https://github.com/huggingface/peft/issues/746.
In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.
Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:
- This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3
- And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2
Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.
I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.
In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.
Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:
- This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3
- And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2
Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.
but i meet some problems when pip, how can you solve this :
I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746. In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint. Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:
- This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3
- And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2
Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.
but i meet some problems when pip, how can you solve this :
just ignore it
On my end, this compatibility issue only causes errors during testing. Therefore, I maintain two separate conda environments: one for training (with transformers==4.39.3) and one for testing (with transformers==4.37.1). While this setup may seem redundant, it offers a quick solution to address the problem.
I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.
In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.
Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:
- This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3
- And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2
Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.
Thanks! Solved my issue. I tried to save and load the LoRA checkpoints but had problems for a while
I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle
I am afraid that non_lora_trainables.bin
will not be loaded by just setting trainer.train(resume_from_checkpoint=True), because non_lora_trainables.bin
is a name only specific to LLaVA and is outside the scope of huggingface.
Could anyone clarify this point?
Added : It seems that non_lora_trainables.bin
is not even saved at intermediate saving steps (at every args.save_steps iterations). It is saved only when the whole training schedule is ended. In any case, I am afraid that non_lora_trainables.bin
will not be loaded by using huggingface APIs, including other ways such as in #1027
Maybe we have to insert a code to load non_lora_trainables.bin
in llava/train/train.py, just as is done, for example, in llava/eval/model_vqa.py. I would appreciate comments if I am misunderstanding.