LLaMA-Factory Serious misalignment in LLaVA implementation

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.0.2
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA RTX 6000 Ada Generation
Bitsandbytes version: 0.44.1

Reproduction

model

model_name_or_path: llava-hf/llava-1.5-7b-hf

method

stage: sft do_train: true finetuning_type: lora lora_target: all

dataset

dataset: mllm_demo template: llava cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/llava1_5-7b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

Expected behavior

In the original LLaVA implementation (see, e.g., https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_lora.sh), they use "--learning_rate 2e-4" for the training (visual instruction tuning) but assign "--mm_projector_lr 2e-5" for the mm_projector specifically.

However, in your implementation, the code seems to assign a single learning rate over all modules. This could be the reason why people complained in many issues (e.g., https://github.com/hiyouga/LLaMA-Factory/issues/5890, https://github.com/hiyouga/LLaMA-Factory/issues/5824, https://github.com/hiyouga/LLaMA-Factory/issues/5672) that the results were not reproducible.

Others

The misalignment may be due to the difference in the method 'create_optimizer'. In LLaVA implementation (see https://github.com/haotian-liu/LLaVA/blob/main/llava/train/llava_trainer.py), the hyperparameters (weight_decay, lr) are assigned separately for different parameter groups. Llama-factory may not yet feature it.

Would you kindly fix this bug? In its current form, it would be incorrect to claim "we supported fine-tuning the LLaVA-1.5 multimodal LLMs"

Nov 12 '24 13:11 oncleJules

+1， looking forward to the author’s improvement @hiyouga

Nov 12 '24 13:11 zhipeixu

+1,looking forward to the author’s improvement @hiyouga

Nov 13 '24 10:11 mfj9999

+1,looking forward to the author’s improvement @hiyouga

Dec 10 '24 07:12 JaireYu

I think mm_projector in not trained under this setting. You need to add a line additional_target: multi_modal_projector in your yaml to set multi_modal_projector trainable

Jan 22 '25 02:01 davidluciolu

+1,looking forward to the code improvement

Feb 10 '25 02:02 MengHao666

+1, looking forward to the author’s improvement @hiyouga

Feb 18 '25 08:02 OBJECT907