LLaMA-Factory qwen3:30b-a3b SFT with lora-rank as 16 is very very slow

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

qwen3:30b-a3b SFT with lora-rank as 16 is very very slow, even 1 step lora SFT train time takes 5+min on this project's example data.

Reproduction

Put your message here.

Others

No response

May 13 '25 02:05 wenhui-ml

The same issue, GPU utilization is less than 20%

May 13 '25 06:05 arkerwu

一样的问题

May 14 '25 13:05 wumaotegan

See https://github.com/QwenLM/Qwen3/issues/736#issuecomment-2207996348

May 15 '25 03:05 Kuangdd01

The same issue, has anyone solved it?

Jun 23 '25 13:06 tianbinli

I've made a fused Qwen3 MoE layer for faster fine-tuning, see the discussion in Unsloth: https://github.com/unslothai/unsloth/discussions/2890 , and I guess it can also be used in llama-factory

Jul 06 '25 12:07 woct0rdho

@woct0rdho Looking forward to its integration in Llama factory!

Jul 06 '25 12:07 hiyouga

大家能分享下自己的训练脚本不，我这个脚本切换a3b模型就跑不动了，我这是8卡A800的啊，每卡有80G显存的，醉了 accelerate launch
--main_process_port 25515
--config_file ./scripts/config.yaml
./src/train.py
--stage sft
--do_train True
--model_name_or_path ${model_path}
--dataset $train_ds
--dataset_dir /opt/nas/p/learning_platform/zouyapeng/docsum/LLaMA-Factory/data
--template qwen3
--finetuning_type lora
--lora_target $lora_target
--output_dir ${output_dir}
--overwrite_cache
--overwrite_output_dir
--cutoff_len 4500
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 20
--eval_steps 20
--eval_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 10.0
--val_size 0.1
--ddp_timeout 180000000
--plot_loss True
--fp16 True
--lora_rank $lora_rank
--lora_alpha $lora_alpha
--lora_dropout 0.1
--enable_thinking False
--early_stopping_steps 5
--trust_remote_code True
--moe_aux_loss_coef 1e-3
--gradient_checkpointing True
--flash_attn auto > /opt/nas/p/learning_platform/zouyapeng/docsum/LLaMA-Factory/saves/dialogue.log \

Aug 28 '25 06:08 DenceChen