Serious misalignment in LLaVA implementation
Reminder
- [X] I have read the README and searched the existing issues.
System Info
llamafactoryversion: 0.9.1.dev0- Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
- Python version: 3.10.15
- PyTorch version: 2.5.1+cu124 (GPU)
- Transformers version: 4.46.1
- Datasets version: 3.0.2
- Accelerate version: 1.0.1
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA RTX 6000 Ada Generation
- Bitsandbytes version: 0.44.1
Reproduction
model
model_name_or_path: llava-hf/llava-1.5-7b-hf
method
stage: sft do_train: true finetuning_type: lora lora_target: all
dataset
dataset: mllm_demo template: llava cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/llava1_5-7b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
Expected behavior
In the original LLaVA implementation (see, e.g., https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_lora.sh), they use "--learning_rate 2e-4" for the training (visual instruction tuning) but assign "--mm_projector_lr 2e-5" for the mm_projector specifically.
However, in your implementation, the code seems to assign a single learning rate over all modules. This could be the reason why people complained in many issues (e.g., https://github.com/hiyouga/LLaMA-Factory/issues/5890, https://github.com/hiyouga/LLaMA-Factory/issues/5824, https://github.com/hiyouga/LLaMA-Factory/issues/5672) that the results were not reproducible.
Others
The misalignment may be due to the difference in the method 'create_optimizer'. In LLaVA implementation (see https://github.com/haotian-liu/LLaVA/blob/main/llava/train/llava_trainer.py), the hyperparameters (weight_decay, lr) are assigned separately for different parameter groups. Llama-factory may not yet feature it.
Would you kindly fix this bug? In its current form, it would be incorrect to claim "we supported fine-tuning the LLaVA-1.5 multimodal LLMs"
+1, looking forward to the author’s improvement @hiyouga
+1,looking forward to the author’s improvement @hiyouga
+1,looking forward to the author’s improvement @hiyouga
I think mm_projector in not trained under this setting.
You need to add a line additional_target: multi_modal_projector in your yaml to set multi_modal_projector trainable
+1,looking forward to the code improvement
+1, looking forward to the author’s improvement @hiyouga