Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

Support for Qwen3-VL models

Open zhaohan-alan opened this issue 2 months ago • 15 comments

🚀 The feature, motivation and pitch

I’d like to use Liger-Kernel with Qwen3-VL models (e.g. Qwen3-VL-30B-A3B-Instruct). Currently, Qwen3-VL cannot initialize with Liger-Kernel.

It would be very appreciated if Qwen3-VL could be supported, which could significantly reduce GPU memory!

Alternatives

No response

Additional context

No response

zhaohan-alan avatar Oct 07 '25 13:10 zhaohan-alan

@Tcc0403 I would like to give it a shot!

mayankagarwals avatar Oct 17 '25 17:10 mayankagarwals

Any update @mayankagarwals @Tcc0403. It looks like it works, but it hasn’t been merged into the main branch yet. I tested it but slows 3 times than off flag :(

John1231983 avatar Oct 23 '25 05:10 John1231983

I tested it but slows 3 times than off flag :(

Hi @John1231983 , I've started efforts here. Will keep the PR updated on progress.

For qwen3, FLCE is integrated and all convergence tests are passing + it's a patch of existing kernel. Since its a tradeoff of memory for speed, it is possible that this is expected but i'll need to confirm it once.

mayankagarwals avatar Oct 23 '25 08:10 mayankagarwals

@John1231983 would you like to share your training recipe? Or maybe a screenshot of profiling trace? With liger kernel, you should be able to train with larger batch size/longer seqlen.

Tcc0403 avatar Oct 23 '25 13:10 Tcc0403

@Tcc0403 Sure First add apply_liger_kernel_to_qwen3_vl to the line https://github.com/2U1/Qwen-VL-Series-Finetune/blob/master/src/train/train_sft.py#L19

Then, change the line https://github.com/2U1/Qwen-VL-Series-Finetune/blob/master/src/train/train_sft.py#L110 to apply_liger_kernel_to_qwen3_vl()

Finally, change flag to True https://github.com/2U1/Qwen-VL-Series-Finetune/blob/master/scripts/finetune.sh#L20

You can setup more batch size in the case. Good luck

John1231983 avatar Oct 23 '25 13:10 John1231983

@John1231983 quick question: what's your environment and config?

The following:

MODEL_NAME="Qwen/Qwen3-VL-4B-Instruct"
GLOBAL_BATCH_SIZE=128
BATCH_PER_DEVICE=4
NUM_DEVICES=8 

with same config in zero3_offload.json?

BTW, have you tried training without offload? with liger it might be possible.

Tcc0403 avatar Oct 23 '25 18:10 Tcc0403

@mayankagarwals Feel free to ping me if you got any problems with the PR

Tcc0403 avatar Oct 23 '25 18:10 Tcc0403

@Tcc0403 @mayankagarwals any update to support Qwen3VL? Thanks

John1231983 avatar Oct 29 '25 14:10 John1231983

Hi @John1231983 Qwen3VL is supported (for FLCE) through this branch . I'm working on mrope which might take some time but you can use it in it's current state and should see benefits. CC @Tcc0403

mayankagarwals avatar Oct 29 '25 14:10 mayankagarwals

When I fine-tune Qwen3-VL with the latest ms-swift framework and transformers==4.57.0, if I turn on liger kernel, the training process will be frozen indefinitely. Can you check it?

VietDunghacker avatar Nov 26 '25 16:11 VietDunghacker

@VietDunghacker Is there any trace or repro you can provide?

Tcc0403 avatar Nov 28 '25 20:11 Tcc0403

@Tcc0403 Thank you for providing support for Qwen3-VL. I have a question specifically regarding the interaction between Liger Kernel and DeepSpeed ZeRO.

After running several experiments, I noticed that:

  • DeepSpeed ZeRO-3 significantly reduces memory usage, but the training speed becomes extremely slow.
  • Even ZeRO-2, while better than ZeRO-3, is still noticeably slower compared to running without DeepSpeed.
  • When not using liger-kernel, it works very fast.
  • Using LoRA with liger-kernel is very fast.

I’m the maintainer of this repository:

https://github.com/2U1/Qwen-VL-Series-Finetune

And I’m using the following configuration to run experiments on 4×A100 GPUs:

#!/bin/bash

# MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
# MODEL_NAME="Qwen/Qwen2-VL-2B-Instruct"
# MODEL_NAME="Qwen/Qwen2.5-VL-3B-Instruct"
# MODEL_NAME="Qwen/Qwen2.5-VL-7B-Instruct"

MODEL_NAME="Qwen/Qwen3-VL-8B-Instruct"

GLOBAL_BATCH_SIZE=8
BATCH_PER_DEVICE=2
NUM_DEVICES=4
GRAD_ACCUM_STEPS=$((GLOBAL_BATCH_SIZE / (BATCH_PER_DEVICE * NUM_DEVICES)))

export PYTHONPATH=src:$PYTHONPATH

deepspeed src/train/train_sft.py \
    --use_liger_kernel True \
    --deepspeed scripts/zero3.json \
    --model_id $MODEL_NAME \
    --data_path /home/workspace/Qwen-VL-Series-Finetune/vlm20/conversations.json \
    --image_folder /home/workspace/Qwen-VL-Series-Finetune/vlm20/images \
    --remove_unused_columns False \
    --freeze_vision_tower False \
    --freeze_llm False \
    --freeze_merger False \
    --bf16 True \
    --fp16 False \
    --disable_flash_attn2 False \
    --output_dir output/test_fft \
    --num_train_epochs 1 \
    --per_device_train_batch_size $BATCH_PER_DEVICE \
    --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
    --image_min_pixels $((512 * 32 * 32)) \
    --image_max_pixels $((1280 * 32 * 32)) \
    --learning_rate 1e-5 \
    --merger_lr 1e-5 \
    --vision_lr 2e-6 \
    --weight_decay 0.1 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --gradient_checkpointing True \
    --report_to tensorboard \
    --lazy_preprocess True \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 10 \
    --dataloader_num_workers 4

With this setup, I tested training on 4×A100 GPUs. Is there any recommended internal setting or best-practice configuration you would suggest for DeepSpeed or Qwen3-VL to avoid such severe slowdowns?

2U1 avatar Nov 29 '25 22:11 2U1

Liger kernel and ZeRO should be orthogonal since we don't modify any logics about communications. Does it only happen on Qwen3VL?

Slowdowns are expected with ZeRO2/3 as you get extra communcation costs, the question is how slow?

With ZeRO optimization strategies, lower memory usage is achieved by sharding optimizer states/gradients/parameters, wrt ZeRO1/2/3 incrementally. You have to pay the communication costs for these tensors in every training steps to ensure the model is correctly updated (identical) on each gpus.

  • ZeRO 1: reduce-scatter in backward + all-gather after optimizer step
  • ZeRO 2: reduce-scatter in backward + all-gather after optimizer step (same as ZeRO1 but release unneeded gradients on the fly)
  • ZeRO 3 (FSDP): all-gather parameters in forward&backward for each layers + reduce-scatter in backward pass + all-gather after optimizer step

My suggestions would be

  1. profile and see actual timeline/statistics - Ideally, training frameworks have several techiniques to hide most of these costs by overlapping compute and communication, minimizing gpu idle time. But if communications happen on low bandwidth channels (inter-node, pcie, ...), your training might spend most of the time on communications instead of actual computations. Profiling is the only way to figure it out.

  2. only pay what you need - All these memory reduction strategies have their own costs, including ZeRO1/2/3 for optimizer states/gradients/parameters memory, or liger's flce for activations memory. If a simple ddp can meet memory requirements, i.e. all memory fits on each gpu, there's no need for further memory reduction strategies to get your training done. But once your training requires higher memory usage, e.g. larger batch size, longer context, or even no offloading, no gradient checkpointing to speed up training, you can try different combinations of these strategies and hyperparams to find the most efficient configuration for your needs in the environment you have.

Tcc0403 avatar Nov 30 '25 01:11 Tcc0403

Thanks for the clarification. Sorry for the confusion — I was referring specifically to Qwen3-VL.

Assuming that DeepSpeed is always enabled in my setup, the issue I’m seeing is that enabling the liger-kernel actually makes training significantly slower compared to disabling it, even under the exact same DeepSpeed configuration.

As you mentioned, the general slowdown from DeepSpeed compared to vanilla DDP is expected due to communication overhead, but the unexpected part is that with the same conditions, turning on liger-kernel results in worse performance than turning it off. That discrepancy was what I wanted to highlight.

2U1 avatar Nov 30 '25 02:11 2U1

Oh sorry I misunderstood your question. So training other models doesn't have similar issues with liger-kernel enabled and exact same configs? Only on Qwen3-VL(Qwen3ForConditionalGeneration)?

Currently qwen3-vl liger kernel patch is incomplete (no swiglu/geglu/layernorm), but it will be addressed in #957. However, current rmsnorm and flce patch should work fine since I didn't find anything special in them. Flce and rmsnorm are patched in the exact same way as in any other models.

See if flce is the root cause, you can disable flce by passing liger_kernel_config={'fused_linear_cross_entropy': False} in TrainingArguments. Alternatively, you can use apply_liger_kernel_to_qwen3_vl from our library if training args config doesn't work.

Or try running with #957 and a larger batch size with liger kernel enabled. FLCE needs larger batch_size x seq_len to staturate GPU. Generally, you should be able to run with much larger batch size that would normally OOM without liger flce

Image Image

Tcc0403 avatar Nov 30 '25 03:11 Tcc0403

Oh sorry I misunderstood your question. So training other models doesn't have similar issues with liger-kernel enabled and exact same configs? Only on Qwen3-VL(Qwen3ForConditionalGeneration)?

Currently qwen3-vl liger kernel patch is incomplete (no swiglu/geglu/layernorm), but it will be addressed in #957. However, current rmsnorm and flce patch should work fine since I didn't find anything special in them. Flce and rmsnorm are patched in the exact same way as in any other models.

See if flce is the root cause, you can disable flce by passing liger_kernel_config={'fused_linear_cross_entropy': False} in TrainingArguments. Alternatively, you can use apply_liger_kernel_to_qwen3_vl from our library if training args config doesn't work.

Or try running with #957 and a larger batch size with liger kernel enabled. FLCE needs larger batch_size x seq_len to staturate GPU. Generally, you should be able to run with much larger batch size that would normally OOM without liger flce

Image Image

I appreciate your work with #957. I've tested it and it seems to work quote well through axolotl, thank you. Regarding 2U1's comment, I see a performance regression with liger_fused_linear_cross_entropy: true but it's fairly miniscule, changing my step speed from 11.5s to 12.5s with FSDP2.

thad0ctor avatar Dec 02 '25 04:12 thad0ctor