Qwen-VL [BUG] ValueError: Cannot merge LORA layers when the model is gptq quantized | When merging a LORA-finetuned Qwen-VL-Chat-Int4

[BUG] ValueError: Cannot merge LORA layers when the model is gptq quantized | When merging a LORA-finetuned Qwen-VL-Chat-Int4

Open AHMAD-DOMA opened this issue 1 year ago • 4 comments

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I finetuned the "Qwen-VL-Chat-Int4" using lora, by using this command:

#!/bin/bash
!export CUDA_DEVICE_MAX_CONNECTIONS=1
!DIR=`pwd`


MODEL= "Qwen/Qwen-VL-Chat-Int4" # See the section for finetuning in README for more information.
DATA="/content/training_dataset.json"

!export CUDA_VISIBLE_DEVICES=0

!python finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --fix_vit True \
    --output_dir output_qwen_v3 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --gradient_checkpointing \
    --use_lora

And when trying to merge the model using the following code:

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "output_qwen_v3 ",
    device_map="auto",
    trust_remote_code=True
).eval()

merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary. 
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)

The following error arises:

/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py in _unload_and_optionally_merge(self, merge, progressbar, safe_merge, adapter_names) 425 if merge: 426 if getattr(self.model, "quantization_method", None) == "gptq": --> 427 raise ValueError("Cannot merge LORA layers when the model is gptq quantized") 428 429 self._unloading_checks(adapter_names)

ValueError: Cannot merge LORA layers when the model is gptq quantized

期望行为 | Expected Behavior

As described at the end of the finetuning section this code will merge the LORA adapter with the pre-trained model in a standalone model.

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS: Linux (Colab)
- peft==0.7

备注 | Anything else?

I need to merge the model to reduce the latency during inference. Currently, loading the base model and the LoRA model separately during inference is causing latency.

BTW, the latency issue described here Huggingface: merge-lora-weights-into-the-base-model