LLaMA-Factory
LLaMA-Factory copied to clipboard
在不支持 flashattn2 的 GPU 上Gemma3 微调报错(警告)导致训练loss为0
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
[2025-04-17 23:04:14,803] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 04-17 23:04:16 [init.py:239] Automatically detected platform cuda.
llamafactoryversion: 0.9.3.dev0- Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
- Python version: 3.12.8
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 4.51.3
- Datasets version: 3.5.0
- Accelerate version: 1.6.0
- PEFT version: 0.15.1
- TRL version: 0.9.6
- GPU type: Tesla V100S-PCIE-32GB
- GPU number: 3
- GPU memory: 31.73GB
- DeepSpeed version: 0.15.4
- vLLM version: 0.8.4
Reproduction
直接使用LoRA+DeepSpeed Stage3训练
部分相关参数:
cutoff_len: 10240
deepspeed: cache/ds_z3_config.json
enable_liger_kernel: true
finetuning_type: lora
flash_attn: auto
fp16: true
gradient_accumulation_steps: 1
learning_rate: 5.0e-05
lora_alpha: 64
lora_dropout: 0
lora_rank: 32
lora_target: all
lr_scheduler_type: cosine
max_grad_norm: 1.0
optim: adamw_torch
packing: false
per_device_train_batch_size: 2
stage: sft
template: gemma3
trust_remote_code: true
warmup_steps: 30
得到以下警告
It is strongly recommended to train Gemma3 models with the eager attention implementation instead of sdpa.
此后,出现loss为0并报错退出(loss scale已经为最小)
Others
是否与以下问题相关?
https://github.com/huggingface/transformers/issues/33333
it's a warning not that important, and it's safe to ignore. Consider using q_proj,v_proj,k_proj,o_proj,gate_proj,up_proj,down_proj in LoRA modules field for gemma fine tuning. Cosine_with_restarts is a much better choice. LoRA_dropout: 0.1 or 0.05.