vllm [Feature]: Qwen 3 MoE Lora adapter support.

🚀 The feature, motivation and pitch

Feature Proposal:

Support for Qwen 3 MoE LoRA (Low-Rank Adaptation) adapter in vLLM to enable efficient fine-tuning and inference.

Motivation:

The Qwen 3 MoE model offer very good capabilities and performance. However, current vLLM do not support integration with LoRA adapters for fine-tuning and serving multiple finetuned models.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

May 14 '25 06:05 bi1101

The main issue is that FusedMoE doesn't support LoRA, which is blocking this feature.

May 14 '25 08:05 jeejeelee

Thanks for the info! Are there any plans to support that anytime soon?

May 14 '25 08:05 bi1101

The main issue is that FusedMoE doesn't support LoRA, which is blocking this feature.主要问题是 FusedMoE 不支持 LoRA，这阻碍了此功能。

Can LoRA weights be loaded into FusedMoE to enable LoRA inference?

May 20 '25 06:05 chenchen0611

Not sure what's the extent required to implement Lora for MoE but from what I see, even popular MoE models like Deepseek, Llama4, etc do not support Lora.

This makes be think this will not be supported for a while.... A bit disappointing since MoE models are very efficient.

May 21 '25 15:05 bi1101

The latest Docs seems to suggest that Qwen 3 MoE Lora is supported. Is this a bug?

https://docs.vllm.ai/en/latest/models/supported_models.html#feature-status-legend_1

May 29 '25 05:05 bi1101

Just a bug in the docs😞, no model implementation.

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen3_moe.py

May 29 '25 05:05 bi1101

Yes, it's a bug, I am working on fixing it

May 29 '25 05:05 jeejeelee

@bi1101 can you share your lora config?

May 29 '25 06:05 jeejeelee

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "Qwen/Qwen3-30B-A3B",
  "bias": "none",
  "corda_config": null,
  "eva_config": null,
  "exclude_modules": null,
  "fan_in_fan_out": null,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 64,
  "lora_bias": false,
  "lora_dropout": 0.05,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "up_proj",
    "down_proj",
    "k_proj",
    "q_proj",
    "gate_proj",
    "o_proj"
  ],
  "task_type": "CAUSAL_LM",
  "trainable_token_indices": null,
  "use_dora": false,
  "use_rslora": false

May 29 '25 06:05 bi1101

It seems that the expert layers have been fine-tuned, which indeed makes it difficult to support LoRA in the short term.

May 29 '25 06:05 jeejeelee

Thanks for the info. It's a bit disappointing but it at least give us some insight to switch away from MoEs for finetuning

May 29 '25 07:05 bi1101

how big is the performance difference between FusedMoE and building a custom MoE torch module using ColumnParallelLinear and RowParallelLinear yourself? For the custom torch module solution, adapter support should come automatically as those linear layers support lora adapters.

Or maybe there are other considerations that I missed out?

Jun 19 '25 12:06 yx222

I think its bad idea to add lora to all experts, it will just slow down the model a lot. I did a finetune by only applying it to attention layers + added a 2x size fixed expert that i fully trained. Maybe supporting something similar would be better, i will try to make my stuff working if i can figure out how to do it :)

Jun 19 '25 20:06 nepeee

Strongly support this proposal. From an engineering perspective, prioritizing LoRA support for only the attention layers ('q_proj', 'k_proj', 'v_proj', 'o_proj') in the initial Qwen 3 MoE integration would be highly beneficial.

This focused approach offers significant memory/compute efficiency, aligns with common LoRA practices, and could accelerate the feature's rollout.

Jul 01 '25 03:07 aweffr

I added the lora support based on the dense qwen3 model and it seems to be working even with my hybrid implementation.

Jul 01 '25 09:07 nepeee

Strongly support this proposal. From an engineering perspective, prioritizing LoRA support for only the attention layers ('q_proj', 'k_proj', 'v_proj', 'o_proj') in the initial Qwen 3 MoE integration would be highly beneficial.

This focused approach offers significant memory/compute efficiency, aligns with common LoRA practices, and could accelerate the feature's rollout.

Sounds good, I will try to support this ASAP, also welcom to contribution

Jul 01 '25 15:07 jeejeelee

waiting

Nov 04 '25 07:11 ehsanahmadkhan525

vllm vllm copied to clipboard

[Feature]: Qwen 3 MoE Lora adapter support.

🚀 The feature, motivation and pitch

Feature Proposal:

Motivation:

Alternatives

Additional context

Before submitting a new issue...

vllm
vllm copied to clipboard