nano-vllm Add FuseMoeLinear and support Qwen3-Moe

Add FuseMoeLinear and support Qwen3-Moe

Open Tokisakix opened this issue 1 month ago • 1 comments

PR Summary

This PR adds Mixture-of-Experts (MoE) support to nano-vllm, addressing the current lack of MoE-related operators. The Qwen3 MoE model (and other MoE models) in Hugging Face Transformers are known to be extremely slow, mainly due to Python-level for loops over experts during the forward pass. This PR introduces a fused Triton-based kernel and corresponding layer implementations to enable efficient MoE inference.

Key Changes

Added operator fused_moe_linear
- Location: nanovllm/layers/fuse_moe
- Implementation is largely based on the MoE kernel in Unsloth, which in turn uses Triton’s grouped GEMM.
Added Linear layer wrappers built on top of fused_moe_linear:
- ReplicatedFusedMoeLinear
- ColumnParallelFusedMoeLinear
- RowParallelFusedMoeLinear
- MergedColumnParallelFusedMoeLinear These enable seamless MoE integration in distributed model architectures.
Introduced Qwen3MoeForCausalLM
- Enables running Qwen3-MoE models efficiently with nano-vllm
- Designed to be extensible for other MoE-based models in the future

Test Configuration

Component	Specification
Model	Qwen3-30B-A3B
Hardware	A100-SXM4-80GB × 1
Total Requests	8 / 256 sequences
Input Length	Randomly sampled between 100–1024 tokens
Output Length	Randomly sampled between 100–1024 tokens

Inference Results

Model	Output Tokens	Time (s)	Throughput (tokens/s)
Nano-vLLM	133,966	105.49	1269.94
Native HF Impl	4,186	480.59	8.71

Oct 16 '25 08:10 Tokisakix

In class Qwen3MoeFusedSparseMoeBlock, the parameter 'gate' is defined using RowParallelLinear. I think this is incorrect. If tp_size > 1, the weight will be partitioned to different cards, and input_size cannot match hidden_states's output_size. I think ReplicatedLinear is the right choice which doesn't partition the weight belong different cards.

Nov 07 '25 06:11 ygch

nano-vllm nano-vllm copied to clipboard

Add FuseMoeLinear and support Qwen3-Moe

PR Summary

Key Changes

Test Configuration

Inference Results

nano-vllm
nano-vllm copied to clipboard