alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

Why add the following code snippet since it would bring some errors?

Open s1530129650 opened this issue 2 years ago • 20 comments
trafficstars

Why add the following code snippet in this commit since it would bring some errors? https://github.com/tloen/alpaca-lora/blob/main/finetune.py#L256-L261

   old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Error 1: Maximum recursion depth exceeded #37 Error 2: model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora 286

s1530129650 avatar Apr 11 '23 15:04 s1530129650

That code is a binding script that basically just says "Any time you try to get the model's state dict, get the peft_model's state dict instead".

I presume it's there due to some issue with the trainer not loading peft properly.

ElleLeonne avatar Apr 11 '23 16:04 ElleLeonne

Ran into same issue in most recent train run, adapter_model.bin was only 443 B. Commenting out that section of code fixed the issue during training for me.

I dove the git history but it was not clear to me why it was added but removing it seems to fix finetune.py

richardklafter avatar Apr 11 '23 17:04 richardklafter

@tloen Hi Eric J. Do you have any considerations or motivations to add the above code snippets?

s1530129650 avatar Apr 12 '23 07:04 s1530129650

It's interesting, my alpaca run produced a 36mb file, and had really good results. Then, when I merged it and tried to finetune my own custom dataset, the model refused to improve, and my adapter_model.bin was in the bytes.

Maybe the peft update broke this, too? I'll try to verify whether this is still necessary or not.

ElleLeonne avatar Apr 12 '23 12:04 ElleLeonne

It's interesting, my alpaca run produced a 36mb file, and had really good results. Then, when I merged it and tried to finetune my own custom dataset, the model refused to improve, and my adapter_model.bin was in the bytes.

Maybe the peft update broke this, too? I'll try to verify whether this is still necessary or not.

+1. The performance is also poor when fine-tuning llama-7b with lora on alpaca_data.json. I think it is also the problem of the peft update.

My results are the same as https://github.com/tloen/alpaca-lora/issues/326.

FYI I ran it on 8v100-32G and installed packages via :

pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
pip uninstall  transformers -y
pip install git+https://github.com/huggingface/transformers.git@fe1f5a639d93c9272856c670cff3b0e1a10d5b2b

s1530129650 avatar Apr 12 '23 12:04 s1530129650

Update the training loss. I fine-tuned llama-7b with lora on alpaca_data.json three time. The training losses are shown as follows.

image

image

image

s1530129650 avatar Apr 12 '23 12:04 s1530129650

Trying to remove the script causes CUDA crashes on my end during training

ElleLeonne avatar Apr 12 '23 12:04 ElleLeonne

Trying to remove the script causes CUDA crashes on my end during training

Do you mean comment/remove the following code?

   old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

I am running some programs with commenting them.

s1530129650 avatar Apr 12 '23 12:04 s1530129650

Yes, it's a binding script.

ElleLeonne avatar Apr 12 '23 12:04 ElleLeonne

Thanks!

I also saw some issues (https://github.com/tloen/alpaca-lora/issues/288, https://github.com/tloen/alpaca-lora/issues/170) discussing GPU type would affect the training process.

The root cause may be the peft update instead of GPU type. I think it is worth verifying.

Yes, it's a binding script.

s1530129650 avatar Apr 12 '23 13:04 s1530129650

Trying to remove the script causes CUDA crashes on my end during training

Do you mean comment/remove the following code?

   old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

I am running some programs with commenting them.

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

FYI:

peft == 0.3.0.dev0 
transformers == 4.28.0.dev0

Awenbocc avatar Apr 12 '23 13:04 Awenbocc

Thanks for your sharing! I think one quick solution is to install peft and transformer of the specified version. For example,

pip install git+https://github.com/huggingface/peft.git@xxxxxxx
pip install git+https://github.com/huggingface/transformers.git@xxxx

We can try multi-combinations of different packages versions.

Trying to remove the script causes CUDA crashes on my end during training

Do you mean comment/remove the following code?

   old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

I am running some programs with commenting them.

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

FYI:

peft == 0.3.0.dev0 
transformers == 4.28.0.dev0

s1530129650 avatar Apr 12 '23 14:04 s1530129650

Installing old versions is not a permanent fix, especially with how new PEFT is. We're gonna want the new features

ElleLeonne avatar Apr 12 '23 14:04 ElleLeonne

I commented the following code and fine-tuned llama-7b with lora on alpaca_data.json in 8V100. The training loss still became 0 at the iteration 560.

Is it a problem with multi-card running programs? Is it incorrect to run the program with torchrun?

finetune.py

...
    # old_state_dict = model.state_dict
    # model.state_dict = (
    #     lambda self, *_, **__: get_peft_model_state_dict(
    #         self, old_state_dict()
    #     )
    # ).__get__(model, type(model))

    # if torch.__version__ >= "2" and sys.platform != "win32":
    #     model = torch.compile(model)
....

Training script

WORLD_SIZE=${number_card}  torchrun --nproc_per_node=${number_card} --master_port=13214  finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --data_path ${data_filename} \
    --cutoff_len=512 \
    --output_dir=${output_dir} \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8 \
    --wandb_project "Alpaca-Lora" 2>&1 |tee ${output_dir}/finetune.log
    #   --wandb_project "Alpaca-Lora-SE"
}

Training loss

image

s1530129650 avatar Apr 12 '23 14:04 s1530129650

有趣的是,我的 alpaca run 产生了一个 36mb 的文件,并且取得了非常好的结果。然后,当我合并它并尝试微调我自己的自定义数据集时,模型拒绝改进,我的 adapter_model.bin 是字节。 也许 peft 更新也打破了这个?我会尝试验证这是否仍然有必要。

+1。在使用 lora on 微调 llama-7b 时,性能也很差alpaca_data.json。 我认为这也是peft更新的问题。

我的结果与#326相同。

仅供参考, 我在 8v100-32G 上运行它并通过以下方式安装软件包:

pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
pip uninstall  transformers -y
pip install git+https://github.com/huggingface/transformers.git@fe1f5a639d93c9272856c670cff3b0e1a10d5b2b

It's interesting, my alpaca run produced a 36mb file, and had really good results. Then, when I merged it and tried to finetune my own custom dataset, the model refused to improve, and my adapter_model.bin was in the bytes. Maybe the peft update broke this, too? I'll try to verify whether this is still necessary or not.

+1. The performance is also poor when fine-tuning llama-7b with lora on alpaca_data.json. I think it is also the problem of the peft update.

My results are the same as #326.

FYI I ran it on 8v100-32G and installed packages via :

pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
pip uninstall  transformers -y
pip install git+https://github.com/huggingface/transformers.git@fe1f5a639d93c9272856c670cff3b0e1a10d5b2b

@s1530129650 Hello, I have the same problem. Which versions of peft and transformers be used to solove the problem?

LutaoChu avatar Apr 13 '23 10:04 LutaoChu

@s1530129650 Hello, I have the same problem. Which versions of peft and transformers be used to solove the problem?

Now I don't know sth

s1530129650 avatar Apr 13 '23 11:04 s1530129650

Trying to remove the script causes CUDA crashes on my end during training

Do you mean comment/remove the following code?

   old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

I am running some programs with commenting them.

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

FYI:

peft == 0.3.0.dev0 
transformers == 4.28.0.dev0

I meet the same problem.

dawnranger avatar Apr 13 '23 12:04 dawnranger

Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

This should solve the below issue:

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

pacman100 avatar Apr 18 '23 07:04 pacman100

It is a real problem or not ? why the codes in finetune.py still exists?

extrame avatar Jun 12 '23 08:06 extrame

Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

This should solve the below issue:

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

Elegant solution!

Comment the

   old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

and use the Callback to save the PEFT params.

mzhaoshuai avatar Oct 26 '23 06:10 mzhaoshuai