vllm
vllm copied to clipboard
[Feature]: Qwen 3 MoE Lora adapter support.
🚀 The feature, motivation and pitch
Feature Proposal:
Support for Qwen 3 MoE LoRA (Low-Rank Adaptation) adapter in vLLM to enable efficient fine-tuning and inference.
Motivation:
The Qwen 3 MoE model offer very good capabilities and performance. However, current vLLM do not support integration with LoRA adapters for fine-tuning and serving multiple finetuned models.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The main issue is that FusedMoE doesn't support LoRA, which is blocking this feature.
Thanks for the info! Are there any plans to support that anytime soon?
The main issue is that
FusedMoEdoesn't support LoRA, which is blocking this feature.主要问题是FusedMoE不支持 LoRA,这阻碍了此功能。
Can LoRA weights be loaded into FusedMoE to enable LoRA inference?
Not sure what's the extent required to implement Lora for MoE but from what I see, even popular MoE models like Deepseek, Llama4, etc do not support Lora.
This makes be think this will not be supported for a while.... A bit disappointing since MoE models are very efficient.
The latest Docs seems to suggest that Qwen 3 MoE Lora is supported. Is this a bug?
https://docs.vllm.ai/en/latest/models/supported_models.html#feature-status-legend_1
Just a bug in the docs😞, no model implementation.
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen3_moe.py
Yes, it's a bug, I am working on fixing it
@bi1101 can you share your lora config?
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "Qwen/Qwen3-30B-A3B",
"bias": "none",
"corda_config": null,
"eva_config": null,
"exclude_modules": null,
"fan_in_fan_out": null,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 64,
"lora_bias": false,
"lora_dropout": 0.05,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 64,
"rank_pattern": {},
"revision": null,
"target_modules": [
"v_proj",
"up_proj",
"down_proj",
"k_proj",
"q_proj",
"gate_proj",
"o_proj"
],
"task_type": "CAUSAL_LM",
"trainable_token_indices": null,
"use_dora": false,
"use_rslora": false
It seems that the expert layers have been fine-tuned, which indeed makes it difficult to support LoRA in the short term.
Thanks for the info. It's a bit disappointing but it at least give us some insight to switch away from MoEs for finetuning
how big is the performance difference between FusedMoE and building a custom MoE torch module using ColumnParallelLinear and RowParallelLinear yourself? For the custom torch module solution, adapter support should come automatically as those linear layers support lora adapters.
Or maybe there are other considerations that I missed out?
I think its bad idea to add lora to all experts, it will just slow down the model a lot. I did a finetune by only applying it to attention layers + added a 2x size fixed expert that i fully trained. Maybe supporting something similar would be better, i will try to make my stuff working if i can figure out how to do it :)
Strongly support this proposal. From an engineering perspective, prioritizing LoRA support for only the attention layers ('q_proj', 'k_proj', 'v_proj', 'o_proj') in the initial Qwen 3 MoE integration would be highly beneficial.
This focused approach offers significant memory/compute efficiency, aligns with common LoRA practices, and could accelerate the feature's rollout.
I added the lora support based on the dense qwen3 model and it seems to be working even with my hybrid implementation.
Strongly support this proposal. From an engineering perspective, prioritizing LoRA support for only the attention layers ('q_proj', 'k_proj', 'v_proj', 'o_proj') in the initial Qwen 3 MoE integration would be highly beneficial.
This focused approach offers significant memory/compute efficiency, aligns with common LoRA practices, and could accelerate the feature's rollout.
Sounds good, I will try to support this ASAP, also welcom to contribution
waiting