Slow training of Qwen3-VL-30B-A3B with DeepSpeed Stage 3
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
llamafactory-cli env -->
llamafactoryversion: 0.9.4.dev0- Platform: Linux-5.14.0-362.24.1.el9_3.0.1.x86_64-x86_64-with-glibc2.34
- Python version: 3.11.13
- PyTorch version: 2.9.0+cu128 (GPU)
- Transformers version: 4.57.1
- Datasets version: 4.0.0
- Accelerate version: 1.11.0
- PEFT version: 0.17.1
- GPU type: NVIDIA H100 80GB HBM3
- GPU number: 8
- GPU memory: 79.10GB
- TRL version: 0.9.6
- DeepSpeed version: 0.16.9
- Liger-kernel version: 0.6.3
- Git commit: 3057db15c33cc2b3b13e36712c5663e250a73136
- Default data directory: detected
Reproduction
During my experiments with Qwen3VL_30B_A3B I was trying to train on [text+image] and [text only] data on SFT tasks. With DeepSpeed stage 3, without CPU offload, with bs = 1 and grad. accum=4
~On average it computes 1 step (bs1, grad. accum_4 in 166 secs on 8 * H100 GPUs)
I'm curious what might be the reason that the training is slow and what I can try. 🤔
Question: Is it because of how MOE works and splitting it using DeepSpeedStage3 causes significant training speed degradation?
What I tried:
- fp8 training but it didn't work out.
neat_packing: truealso didn't work out (if I understand correctly) because of the transformers version mismatch, but it's Qwen3VL so I need almost latest version of transformers.- Larger bs --> OOM
- DeepSpeedStage2 --> OOM
What I think might work:
- CPU offload + larger bs
- FA2 instead of SDPA (I specified FA2 but it falls back to SDPA)
- DeepSpeedStage2 with lora vs DeepSpeedStage3 with lora, to compare how splitting model across GPU affects.
Qwen3VL_30B_A3B training_config:
### model
model_name_or_path: models/Qwen3-VL-30B-A3B-Thinking
image_max_pixels: 50176
image_min_pixels: 784
video_max_pixels: 16384
trust_remote_code: true
flash_attn: fa2
### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: false
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json
### dataset
## test
dataset: seq_len_test
template: qwen3_vl # [qwen3_vl](by default it's with thinking) or [qwen3_vl_nothink] --> {./LLaMA-Factory/src/llamafactory/data/template.py line 1933}
cutoff_len: 8196 # Cutoff length for the dataset
max_samples: 100000000
overwrite_cache: false
preprocessing_num_workers: 16
dataloader_num_workers: 4
data_seed: 42
group_by_length: true
packing: true
# neat_packing: true #TODO: maybe near_packing will allow to train faster?
### output
output_dir: ./qwen3vl30BA3B_mlp_llm/train_v1 # @CHANGABLE
logging_steps: 1
save_steps: 500 # @CHANGABLE
save_strategy: steps
save_total_limit: 10
overwrite_output_dir: true
report_to: wandb # or tensorboard
run_name: qwen3vl-30b-moe-full-finetune-mlp_llm
### train
per_device_train_batch_size: 1 # @CHANGABLE
gradient_accumulation_steps: 4 # @CHANGABLE
learning_rate: 8.0e-5 # @CHANGABLE
num_train_epochs: 1.0
lr_scheduler_type: cosine
enable_liger_kernel: true
gradient_checkpointing: true
# 111 h (160000 samples)
### FP8 config
# fp8: true
# fp8_backend: torchao # Use TorchAO backend for FP8
# fp8_enable_fsdp_float8_all_gather: false # Not used with DeepSpeed
resume_from_checkpoint: null
max_grad_norm: 1.0
bf16: true
weight_decay: 0.0
warmup_ratio: 0.03
ddp_timeout: 180000000
I also noticed that GPU Power Usage is quite low, it should be around 600W at least.
Others
Issue: slow training of Qwen3-VL-30B-A3 using DeepSpeed Stage 3
Same, looking for the solution to make full use of the gpu
Ok
Le sam. 1 nov. 2025, 17:25, Zhuoming Liu @.***> a écrit :
dragonlzm left a comment (hiyouga/LLaMA-Factory#9374) https://github.com/hiyouga/LLaMA-Factory/issues/9374#issuecomment-3476576286
Same, looking for the solution to make full use of the gpu
— Reply to this email directly, view it on GitHub https://github.com/hiyouga/LLaMA-Factory/issues/9374#issuecomment-3476576286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZMHEMHHXQGJFNBOXOCECST32TUHLAVCNFSM6AAAAACKUB3ZGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTINZWGU3TMMRYGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>