nano-vllm
nano-vllm copied to clipboard
Add FuseMoeLinear and support Qwen3-Moe
PR Summary
This PR adds Mixture-of-Experts (MoE) support to nano-vllm, addressing the current lack of MoE-related operators.
The Qwen3 MoE model (and other MoE models) in Hugging Face Transformers are known to be extremely slow, mainly due to Python-level for loops over experts during the forward pass.
This PR introduces a fused Triton-based kernel and corresponding layer implementations to enable efficient MoE inference.
Key Changes
-
Added operator
fused_moe_linear- Location:
nanovllm/layers/fuse_moe - Implementation is largely based on the MoE kernel in Unsloth, which in turn uses Triton’s grouped GEMM.
- Location:
-
Added Linear layer wrappers built on top of
fused_moe_linear:ReplicatedFusedMoeLinearColumnParallelFusedMoeLinearRowParallelFusedMoeLinearMergedColumnParallelFusedMoeLinearThese enable seamless MoE integration in distributed model architectures.
-
Introduced
Qwen3MoeForCausalLM- Enables running Qwen3-MoE models efficiently with nano-vllm
- Designed to be extensible for other MoE-based models in the future
Test Configuration
| Component | Specification |
|---|---|
| Model | Qwen3-30B-A3B |
| Hardware | A100-SXM4-80GB × 1 |
| Total Requests | 8 / 256 sequences |
| Input Length | Randomly sampled between 100–1024 tokens |
| Output Length | Randomly sampled between 100–1024 tokens |
Inference Results
| Model | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| Nano-vLLM | 133,966 | 105.49 | 1269.94 |
| Native HF Impl | 4,186 | 480.59 | 8.71 |
In class Qwen3MoeFusedSparseMoeBlock, the parameter 'gate' is defined using RowParallelLinear. I think this is incorrect. If tp_size > 1, the weight will be partitioned to different cards, and input_size cannot match hidden_states's output_size. I think ReplicatedLinear is the right choice which doesn't partition the weight belong different cards.