LLaMA-Factory SFT后模型合并Lora权重，回答质量下降明显（非Issue #2505，#4913）

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.35
Python version: 3.11.9
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.43.4
Datasets version: 3.0.2
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800 80GB PCIe
DeepSpeed version: 0.14.4
Bitsandbytes version: 0.43.1
vLLM version: 0.5.4

Reproduction

我遇到了和Issue#2505，#4913同样的问题，但在排查后排除了以上两个Issue中的原因。

合并前测试样例（使用SFT数据进行测试，输出标准化的Json数据）：

合并后测试样例（胡乱输出）：

以下提供微调、合并和测试的脚本以及参数：

微调脚本和参数

llamafactory-cli train examples/train_lora/gemma2_lora_sft_causality.yaml

### model
model_name_or_path: /home/ckf/models/gemma-2-27b-it

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 8
lora_alpha: 16
lora_dropout: 0
deepspeed: examples/deepspeed/ds_z2_config.json

### dataset
dataset: causality_train2_cot
template: gemma
cutoff_len: 8192
max_samples: null
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /home/lu/saves/gemma2-27b-it/lora/sft_causality_train2_cot
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

合并脚本和参数

llamafactory-cli export examples/merge_lora/gemma2_lora_sft.yaml

### model
model_name_or_path: /home/ckf/models/gemma-2-27b-it/
adapter_name_or_path: /home/lu/saves/gemma2-27b-it/lora/sft_causality_train2_cot/checkpoint-2000/
template: gemma
finetuning_type: lora

### export
export_dir: /home/lu/models/gemma2_lora_sft_causality_train2_cot
export_size: 5
export_device: auto
export_legacy_format: false

测试代码

llamafactory-cli chat examples/inference/gemma2_lora_sft_causality.yaml

合并前gemma2_lora_sft_causality.yaml内容：

model_name_or_path: /home/ckf/models/gemma-2-27b-it
adapter_name_or_path: /home/lu/saves/gemma2-27b-it/lora/sft_causality_train2_cot/checkpoint-2000/
template: gemma
finetuning_type: lora

合并后gemma2_lora_sft_causality.yaml内容：

model_name_or_path: /home/lu/models/gemma2_lora_sft_causality_train2_cot
template: gemma

Expected behavior

合并后应该与合并前输出同样的标准化Json数据

Others

No response

Nov 04 '24 09:11 BGbigbear

目前只在gemma-2-27-it上发现这个问题，其它模型例如qwen-2.5-32b-it没有这个问题

Nov 05 '24 09:11 BGbigbear

glm4也有问题

Feb 07 '25 06:02 zzk2021

我们在分类任务上遇到了相似现象，大体表现为：训练完后加载各个ckpt单独评估的指标效果比训练过程中去eval 要明显地差。通过最小case，我们在do_eval = true, eval_steps: 30, save_steps: 30 max_steps: 32 复现了。如下图。最后定位是 peft.merge_and_unload() 的问题，注释掉再单独eval，指标效果跟训练时eval就对上了。

代码改动：


        for adapter in adapter_to_merge:
            model: "LoraModel" = PeftModel.from_pretrained(model, adapter, **init_kwargs)
            # model = model.merge_and_unload()    # 注释掉这一行

Mar 31 '25 06:03 Aurelius84

目前只在gemma-2-27-it上发现这个问题，其它模型例如qwen-2.5-32b-it没有这个问题

我微调glm4-9b也有这个问题

Jun 10 '25 11:06 chenchen333-dev

qwen2.5vl-7b-instruct也会

Aug 07 '25 07:08 lukasindeed

qwen2.5vl-7b-instruct也会明显下降

Aug 15 '25 07:08 rosmarinocc

请问解决了吗？这是模型的问题吗，在微调llada合并权重时也遇到了这个问题

Oct 15 '25 05:10 deadlykitten4