在不支持 flashattn2 的 GPU 上Gemma3 微调报错（警告）导致训练loss为0

Open alumik opened this issue 8 months ago • 1 comments

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

[2025-04-17 23:04:14,803] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 04-17 23:04:16 [init.py:239] Automatically detected platform cuda.

llamafactory version: 0.9.3.dev0
Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Python version: 3.12.8
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.51.3
Datasets version: 3.5.0
Accelerate version: 1.6.0
PEFT version: 0.15.1
TRL version: 0.9.6
GPU type: Tesla V100S-PCIE-32GB
GPU number: 3
GPU memory: 31.73GB
DeepSpeed version: 0.15.4
vLLM version: 0.8.4

Reproduction

直接使用LoRA+DeepSpeed Stage3训练

部分相关参数：

cutoff_len: 10240
deepspeed: cache/ds_z3_config.json
enable_liger_kernel: true
finetuning_type: lora
flash_attn: auto
fp16: true
gradient_accumulation_steps: 1
learning_rate: 5.0e-05
lora_alpha: 64
lora_dropout: 0
lora_rank: 32
lora_target: all
lr_scheduler_type: cosine
max_grad_norm: 1.0
optim: adamw_torch
packing: false
per_device_train_batch_size: 2
stage: sft
template: gemma3
trust_remote_code: true
warmup_steps: 30

得到以下警告

It is strongly recommended to train Gemma3 models with the eager attention implementation instead of sdpa.

此后，出现loss为0并报错退出（loss scale已经为最小）

Others

是否与以下问题相关？

https://github.com/huggingface/transformers/issues/33333

Apr 17 '25 15:04 alumik

it's a warning not that important, and it's safe to ignore. Consider using q_proj,v_proj,k_proj,o_proj,gate_proj,up_proj,down_proj in LoRA modules field for gemma fine tuning. Cosine_with_restarts is a much better choice. LoRA_dropout: 0.1 or 0.05.

Apr 20 '25 02:04 rzgarespo