nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

Add FuseMoeLinear and support Qwen3-Moe

Open Tokisakix opened this issue 1 month ago • 1 comments

PR Summary

This PR adds Mixture-of-Experts (MoE) support to nano-vllm, addressing the current lack of MoE-related operators. The Qwen3 MoE model (and other MoE models) in Hugging Face Transformers are known to be extremely slow, mainly due to Python-level for loops over experts during the forward pass. This PR introduces a fused Triton-based kernel and corresponding layer implementations to enable efficient MoE inference.


Key Changes

  1. Added operator fused_moe_linear

    • Location: nanovllm/layers/fuse_moe
    • Implementation is largely based on the MoE kernel in Unsloth, which in turn uses Triton’s grouped GEMM.
  2. Added Linear layer wrappers built on top of fused_moe_linear:

    • ReplicatedFusedMoeLinear
    • ColumnParallelFusedMoeLinear
    • RowParallelFusedMoeLinear
    • MergedColumnParallelFusedMoeLinear These enable seamless MoE integration in distributed model architectures.
  3. Introduced Qwen3MoeForCausalLM

    • Enables running Qwen3-MoE models efficiently with nano-vllm
    • Designed to be extensible for other MoE-based models in the future

Test Configuration

Component Specification
Model Qwen3-30B-A3B
Hardware A100-SXM4-80GB × 1
Total Requests 8 / 256 sequences
Input Length Randomly sampled between 100–1024 tokens
Output Length Randomly sampled between 100–1024 tokens

Inference Results

Model Output Tokens Time (s) Throughput (tokens/s)
Nano-vLLM 133,966 105.49 1269.94
Native HF Impl 4,186 480.59 8.71

Tokisakix avatar Oct 16 '25 08:10 Tokisakix

In class Qwen3MoeFusedSparseMoeBlock, the parameter 'gate' is defined using RowParallelLinear. I think this is incorrect. If tp_size > 1, the weight will be partitioned to different cards, and input_size cannot match hidden_states's output_size. I think ReplicatedLinear is the right choice which doesn't partition the weight belong different cards.

ygch avatar Nov 07 '25 06:11 ygch