LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Slow training of Qwen3-VL-30B-A3B with DeepSpeed Stage 3

Open vladimiralbrekhtccr opened this issue 2 months ago • 1 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

llamafactory-cli env -->

  • llamafactory version: 0.9.4.dev0
  • Platform: Linux-5.14.0-362.24.1.el9_3.0.1.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.13
  • PyTorch version: 2.9.0+cu128 (GPU)
  • Transformers version: 4.57.1
  • Datasets version: 4.0.0
  • Accelerate version: 1.11.0
  • PEFT version: 0.17.1
  • GPU type: NVIDIA H100 80GB HBM3
  • GPU number: 8
  • GPU memory: 79.10GB
  • TRL version: 0.9.6
  • DeepSpeed version: 0.16.9
  • Liger-kernel version: 0.6.3
  • Git commit: 3057db15c33cc2b3b13e36712c5663e250a73136
  • Default data directory: detected

Reproduction

During my experiments with Qwen3VL_30B_A3B I was trying to train on [text+image] and [text only] data on SFT tasks. With DeepSpeed stage 3, without CPU offload, with bs = 1 and grad. accum=4

~On average it computes 1 step (bs1, grad. accum_4 in 166 secs on 8 * H100 GPUs)

I'm curious what might be the reason that the training is slow and what I can try. 🤔

Question: Is it because of how MOE works and splitting it using DeepSpeedStage3 causes significant training speed degradation?

What I tried:

  1. fp8 training but it didn't work out.
  2. neat_packing: true also didn't work out (if I understand correctly) because of the transformers version mismatch, but it's Qwen3VL so I need almost latest version of transformers.
  3. Larger bs --> OOM
  4. DeepSpeedStage2 --> OOM

What I think might work:

  1. CPU offload + larger bs
  2. FA2 instead of SDPA (I specified FA2 but it falls back to SDPA)
  3. DeepSpeedStage2 with lora vs DeepSpeedStage3 with lora, to compare how splitting model across GPU affects.

Qwen3VL_30B_A3B training_config:

### model
model_name_or_path: models/Qwen3-VL-30B-A3B-Thinking
image_max_pixels: 50176
image_min_pixels: 784
video_max_pixels: 16384
trust_remote_code: true
flash_attn: fa2


### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: false
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
## test
dataset: seq_len_test

template: qwen3_vl # [qwen3_vl](by default it's with thinking) or [qwen3_vl_nothink] --> {./LLaMA-Factory/src/llamafactory/data/template.py line 1933}
cutoff_len: 8196 # Cutoff length for the dataset
max_samples: 100000000
overwrite_cache: false
preprocessing_num_workers: 16
dataloader_num_workers: 4
data_seed: 42

group_by_length: true
packing: true

# neat_packing: true #TODO: maybe near_packing will allow to train faster?

### output
output_dir: ./qwen3vl30BA3B_mlp_llm/train_v1 # @CHANGABLE
logging_steps: 1
save_steps: 500 # @CHANGABLE
save_strategy: steps
save_total_limit: 10
overwrite_output_dir: true
report_to: wandb  # or tensorboard
run_name: qwen3vl-30b-moe-full-finetune-mlp_llm


### train
per_device_train_batch_size: 1 # @CHANGABLE
gradient_accumulation_steps: 4 # @CHANGABLE
learning_rate: 8.0e-5 # @CHANGABLE
num_train_epochs: 1.0
lr_scheduler_type: cosine
enable_liger_kernel: true
gradient_checkpointing: true

# 111 h (160000 samples)

### FP8 config

# fp8: true
# fp8_backend: torchao  # Use TorchAO backend for FP8
# fp8_enable_fsdp_float8_all_gather: false  # Not used with DeepSpeed

resume_from_checkpoint: null

max_grad_norm: 1.0
bf16: true
weight_decay: 0.0
warmup_ratio: 0.03
ddp_timeout: 180000000

I also noticed that GPU Power Usage is quite low, it should be around 600W at least.

Image

Others

Issue: slow training of Qwen3-VL-30B-A3 using DeepSpeed Stage 3

vladimiralbrekhtccr avatar Oct 30 '25 05:10 vladimiralbrekhtccr

Same, looking for the solution to make full use of the gpu

dragonlzm avatar Nov 01 '25 17:11 dragonlzm

Ok

Le sam. 1 nov. 2025, 17:25, Zhuoming Liu @.***> a écrit :

dragonlzm left a comment (hiyouga/LLaMA-Factory#9374) https://github.com/hiyouga/LLaMA-Factory/issues/9374#issuecomment-3476576286

Same, looking for the solution to make full use of the gpu

— Reply to this email directly, view it on GitHub https://github.com/hiyouga/LLaMA-Factory/issues/9374#issuecomment-3476576286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZMHEMHHXQGJFNBOXOCECST32TUHLAVCNFSM6AAAAACKUB3ZGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTINZWGU3TMMRYGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

keita78 avatar Nov 19 '25 01:11 keita78