Liger-Kernel fused_linear_cross_entropy caused bad performance

fused_linear_cross_entropy caused bad performance

Open waynechu1021 opened this issue 4 months ago • 0 comments

🐛 Describe the bug

Hello, i'm finetuning Qwen-2.5-VL-3B with trl. When i turn on the liger kernel, it could lead to poor performance while the performance is normal without liger kernel. I later discovered that it was fused_linear_cross_entropy that caused the performance difference. However, their loss looks almost the same, and the problem is that this option is crucial for saving GPU memory.

Here is my training args deepspeed --master_port 25410 src/open_r1/sft.py
--model_name_or_path .cache/Qwen2.5-VL-3B-Instruct
--dataset_name data/instructions_dynamic_combine_action.jsonl
--deepspeed scripts/zero3.json
--learning_rate 2.0e-5
--num_train_epochs 1
--packing
--max_seq_length 64800
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--dataloader_num_workers 4
--gradient_checkpointing True
--torch_dtype bfloat16
--bf16
--logging_steps 5
--eval_strategy no
--eval_steps 100
--save_strategy no
--save_steps 6000
--attn_implementation flash_attention_2
--output_dir result/Qwen2.5-VL_3B_sft_r2r_dynamic_test_fix_liger
--report_to tensorboard

Reproduce

No response

Versions

Environment Report:

Operating System: Linux-6.8.0-54-generic-x86_64-with-glibc2.39 Python version: 3.10.16 Liger Kernel version: 0.5.10 PyTorch version: 2.7.1+cu126 CUDA version: 12.6 HIP(ROCm) version: Not available Triton version: 3.3.1 Transformers version: 4.50.3 Trl version: 0.16.0 XPU version: XPU Not Available

Jul 21 '25 11:07 waynechu1021

Liger-Kernel Liger-Kernel copied to clipboard

fused_linear_cross_entropy caused bad performance

🐛 Describe the bug

Reproduce

Versions

Environment Report:

Liger-Kernel
Liger-Kernel copied to clipboard