vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Qwen 3 MoE Lora adapter support.

Open bi1101 opened this issue 6 months ago • 13 comments

🚀 The feature, motivation and pitch

Feature Proposal:

Support for Qwen 3 MoE LoRA (Low-Rank Adaptation) adapter in vLLM to enable efficient fine-tuning and inference.

Motivation:

The Qwen 3 MoE model offer very good capabilities and performance. However, current vLLM do not support integration with LoRA adapters for fine-tuning and serving multiple finetuned models.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

bi1101 avatar May 14 '25 06:05 bi1101

The main issue is that FusedMoE doesn't support LoRA, which is blocking this feature.

jeejeelee avatar May 14 '25 08:05 jeejeelee

Thanks for the info! Are there any plans to support that anytime soon?

bi1101 avatar May 14 '25 08:05 bi1101

The main issue is that FusedMoE doesn't support LoRA, which is blocking this feature.主要问题是 FusedMoE 不支持 LoRA,这阻碍了此功能。

Can LoRA weights be loaded into FusedMoE to enable LoRA inference?

chenchen0611 avatar May 20 '25 06:05 chenchen0611

Not sure what's the extent required to implement Lora for MoE but from what I see, even popular MoE models like Deepseek, Llama4, etc do not support Lora.

This makes be think this will not be supported for a while.... A bit disappointing since MoE models are very efficient.

bi1101 avatar May 21 '25 15:05 bi1101

The latest Docs seems to suggest that Qwen 3 MoE Lora is supported. Is this a bug?

https://docs.vllm.ai/en/latest/models/supported_models.html#feature-status-legend_1

bi1101 avatar May 29 '25 05:05 bi1101

Just a bug in the docs😞, no model implementation.

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen3_moe.py

bi1101 avatar May 29 '25 05:05 bi1101

Yes, it's a bug, I am working on fixing it

jeejeelee avatar May 29 '25 05:05 jeejeelee

@bi1101 can you share your lora config?

jeejeelee avatar May 29 '25 06:05 jeejeelee

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "Qwen/Qwen3-30B-A3B",
  "bias": "none",
  "corda_config": null,
  "eva_config": null,
  "exclude_modules": null,
  "fan_in_fan_out": null,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 64,
  "lora_bias": false,
  "lora_dropout": 0.05,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "up_proj",
    "down_proj",
    "k_proj",
    "q_proj",
    "gate_proj",
    "o_proj"
  ],
  "task_type": "CAUSAL_LM",
  "trainable_token_indices": null,
  "use_dora": false,
  "use_rslora": false

bi1101 avatar May 29 '25 06:05 bi1101

It seems that the expert layers have been fine-tuned, which indeed makes it difficult to support LoRA in the short term.

jeejeelee avatar May 29 '25 06:05 jeejeelee

Thanks for the info. It's a bit disappointing but it at least give us some insight to switch away from MoEs for finetuning

bi1101 avatar May 29 '25 07:05 bi1101

how big is the performance difference between FusedMoE and building a custom MoE torch module using ColumnParallelLinear and RowParallelLinear yourself? For the custom torch module solution, adapter support should come automatically as those linear layers support lora adapters.

Or maybe there are other considerations that I missed out?

yx222 avatar Jun 19 '25 12:06 yx222

I think its bad idea to add lora to all experts, it will just slow down the model a lot. I did a finetune by only applying it to attention layers + added a 2x size fixed expert that i fully trained. Maybe supporting something similar would be better, i will try to make my stuff working if i can figure out how to do it :)

nepeee avatar Jun 19 '25 20:06 nepeee

Strongly support this proposal. From an engineering perspective, prioritizing LoRA support for only the attention layers ('q_proj', 'k_proj', 'v_proj', 'o_proj') in the initial Qwen 3 MoE integration would be highly beneficial.

This focused approach offers significant memory/compute efficiency, aligns with common LoRA practices, and could accelerate the feature's rollout.

aweffr avatar Jul 01 '25 03:07 aweffr

I added the lora support based on the dense qwen3 model and it seems to be working even with my hybrid implementation.

nepeee avatar Jul 01 '25 09:07 nepeee

Strongly support this proposal. From an engineering perspective, prioritizing LoRA support for only the attention layers ('q_proj', 'k_proj', 'v_proj', 'o_proj') in the initial Qwen 3 MoE integration would be highly beneficial.

This focused approach offers significant memory/compute efficiency, aligns with common LoRA practices, and could accelerate the feature's rollout.

Sounds good, I will try to support this ASAP, also welcom to contribution

jeejeelee avatar Jul 01 '25 15:07 jeejeelee

waiting

ehsanahmadkhan525 avatar Nov 04 '25 07:11 ehsanahmadkhan525