LLaVA [Usage] `resume_from_checkpoint` fails when finetuning in the lora settings

Describe the issue

I think the code is trying to resume_from_checkpoint like its a full-parameter fine-tunung checkpoint.

Feb 29 '24 03:02 zengxingchen

Screenshot 2024-02-29 at 11 40 53

Feb 29 '24 03:02 zengxingchen

I have the same issue. Can anyone tell me how to fix it?

Mar 14 '24 03:03 CynthiaChuang

+1

Mar 25 '24 04:03 qingyuanxingsi

+1

Apr 01 '24 08:04 sunhm15

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: https://github.com/huggingface/peft/issues/746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3
And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

Apr 08 '24 06:04 davidhalladay

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3

And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this :

Apr 25 '24 06:04 zhipeixu

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746. In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint. Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3

And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this :

just ignore it

Apr 25 '24 11:04 Linjyan00

On my end, this compatibility issue only causes errors during testing. Therefore, I maintain two separate conda environments: one for training (with transformers==4.39.3) and one for testing (with transformers==4.37.1). While this setup may seem redundant, it offers a quick solution to address the problem.

Apr 26 '24 01:04 davidhalladay

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3

And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

Thanks! Solved my issue. I tried to save and load the LoRA checkpoints but had problems for a while

Apr 30 '24 02:04 user074

I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle

May 09 '24 04:05 wenyisir

I am afraid that non_lora_trainables.bin will not be loaded by just setting trainer.train(resume_from_checkpoint=True), because non_lora_trainables.bin is a name only specific to LLaVA and is outside the scope of huggingface. Could anyone clarify this point?

Added : It seems that non_lora_trainables.bin is not even saved at intermediate saving steps (at every args.save_steps iterations). It is saved only when the whole training schedule is ended. In any case, I am afraid that non_lora_trainables.bin will not be loaded by using huggingface APIs, including other ways such as in #1027

Maybe we have to insert a code to load non_lora_trainables.bin in llava/train/train.py, just as is done, for example, in llava/eval/model_vqa.py. I would appreciate comments if I am misunderstanding.

May 09 '24 04:05 tetsu-kikuchi

LLaVA LLaVA copied to clipboard

[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings

Describe the issue

LLaVA
LLaVA copied to clipboard