qwen3:30b-a3b SFT with lora-rank as 16 is very very slow
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
qwen3:30b-a3b SFT with lora-rank as 16 is very very slow, even 1 step lora SFT train time takes 5+min on this project's example data.
Reproduction
Put your message here.
Others
No response
The same issue, GPU utilization is less than 20%
一样的问题
See https://github.com/QwenLM/Qwen3/issues/736#issuecomment-2207996348
The same issue, has anyone solved it?
I've made a fused Qwen3 MoE layer for faster fine-tuning, see the discussion in Unsloth: https://github.com/unslothai/unsloth/discussions/2890 , and I guess it can also be used in llama-factory
@woct0rdho Looking forward to its integration in Llama factory!
大家能分享下自己的训练脚本不,我这个脚本切换a3b模型就跑不动了,我这是8卡A800的啊,每卡有80G显存的,醉了
accelerate launch
--main_process_port 25515
--config_file ./scripts/config.yaml
./src/train.py
--stage sft
--do_train True
--model_name_or_path ${model_path}
--dataset $train_ds
--dataset_dir /opt/nas/p/learning_platform/zouyapeng/docsum/LLaMA-Factory/data
--template qwen3
--finetuning_type lora
--lora_target $lora_target
--output_dir ${output_dir}
--overwrite_cache
--overwrite_output_dir
--cutoff_len 4500
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 20
--eval_steps 20
--eval_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 10.0
--val_size 0.1
--ddp_timeout 180000000
--plot_loss True
--fp16 True
--lora_rank $lora_rank
--lora_alpha $lora_alpha
--lora_dropout 0.1
--enable_thinking False
--early_stopping_steps 5
--trust_remote_code True
--moe_aux_loss_coef 1e-3
--gradient_checkpointing True
--flash_attn auto > /opt/nas/p/learning_platform/zouyapeng/docsum/LLaMA-Factory/saves/dialogue.log \