LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

在不支持 flashattn2 的 GPU 上Gemma3 微调报错(警告)导致训练loss为0

Open alumik opened this issue 8 months ago • 1 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

[2025-04-17 23:04:14,803] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 04-17 23:04:16 [init.py:239] Automatically detected platform cuda.

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
  • Python version: 3.12.8
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.51.3
  • Datasets version: 3.5.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: Tesla V100S-PCIE-32GB
  • GPU number: 3
  • GPU memory: 31.73GB
  • DeepSpeed version: 0.15.4
  • vLLM version: 0.8.4

Reproduction

直接使用LoRA+DeepSpeed Stage3训练

部分相关参数:

cutoff_len: 10240
deepspeed: cache/ds_z3_config.json
enable_liger_kernel: true
finetuning_type: lora
flash_attn: auto
fp16: true
gradient_accumulation_steps: 1
learning_rate: 5.0e-05
lora_alpha: 64
lora_dropout: 0
lora_rank: 32
lora_target: all
lr_scheduler_type: cosine
max_grad_norm: 1.0
optim: adamw_torch
packing: false
per_device_train_batch_size: 2
stage: sft
template: gemma3
trust_remote_code: true
warmup_steps: 30

得到以下警告

It is strongly recommended to train Gemma3 models with the eager attention implementation instead of sdpa.

此后,出现loss为0并报错退出(loss scale已经为最小)

Others

是否与以下问题相关?

https://github.com/huggingface/transformers/issues/33333

alumik avatar Apr 17 '25 15:04 alumik

it's a warning not that important, and it's safe to ignore. Consider using q_proj,v_proj,k_proj,o_proj,gate_proj,up_proj,down_proj in LoRA modules field for gemma fine tuning. Cosine_with_restarts is a much better choice. LoRA_dropout: 0.1 or 0.05.

rzgarespo avatar Apr 20 '25 02:04 rzgarespo