[Feature] Qwen3-Next-80B-A3B Support
Motivation
希望能尽快支持80B-A3B,让老卡也能跑起来
Related resources
No response
Additional context
No response
Yes please - for v100s!
感觉有难度,新架构适配都挺麻烦的,ptq量化也不好弄
GPT5 analysis of changes required - correct @maxin9966 ?
What would have to change in LMDeploy TurboMind (for Volta)
-
Model registration / config parsing
- Add a
qwen3_nextarchitecture to TurboMind: parse the HF config keys (hybrid layout, GA/DeltaNet head counts, RoPE dim=64, KV heads=2, GA head_dim=256; DeltaNet head_dim=128; 48 layers; 512 experts, Top-10 activated, 1 shared expert). Map these to TurboMind layer objects. (Hugging Face)
- Add a
-
New attention kernels
-
Gated Attention (GA): implement/fuse Q/K/V projections, gating, RoPE(64), and attention with (Q heads=16, KV heads=2, head_dim=256).
-
Gated DeltaNet: implement the linear-attention style “DeltaNet” path with its gating; separate kernels for QK (16 heads) and V (32 heads), head_dim=128.
-
Integrate both into TurboMind’s decode/kv-management and paged-KV flow (Flash-/Paged-decoding analogues). These blocks don’t exist in TurboMind today. (Hugging Face)
-
-
MoE routing at “Top-10-of-512 (+1 shared)”
- Implement token-router → top-k expert selection, capacity & dispatch, with expert parallel (EP) or equivalent sharding; otherwise all 512 experts must be resident per GPU, which is impractical. TurboMind has MoE support, but scaling to 512 experts with Top-10 dispatch requires efficient all-to-all and placement policies across TP×EP on NVLink. (lmdeploy.readthedocs.io)
-
Volta dtype path (no BF16/FP8 on V100)
-
Add robust BF16→FP16 weight cast on load (and ensure numerics for zero-centered/weight-decayed layernorm remain stable in FP16).
-
Keep compute in FP16; FP8 kernels are not an option on V100. (V100 Tensor Cores accelerate FP16 only; BF16 arrived with Ampere; FP8 with Hopper.) (NVIDIA Images)
-
-
KV-cache quantization reality
- TurboMind does not support 4/8-bit KV when head_dim≠128. GA uses head_dim=256, so you must run FP16 KV (or extend TurboMind to support KV-INTx for non-128 dims). Expect materially higher KV memory at long context. (lmdeploy.readthedocs.io)
-
Speculative/MTP path (optional, for speed)
- Qwen3-Next includes MTP; to match SGLang/vLLM performance, you’d add a speculative decoding plugin (e.g., NEXTN/Eagle variants). If omitted, it still runs—just slower. (Hugging Face)
-
Weight loader & tensor mapping
- Add safetensor→TurboMind weight remapping for
qwen3_next(distinct parameter names for GA/DeltaNet/MoE; shared expert). Provide a converter that flattens experts for EP sharding.
- Add safetensor→TurboMind weight remapping for
-
Distributed runtime
- Enable TP=8 + EP≥8 to fit experts and reduce per-GPU footprint; add all-to-all routing (NCCL) optimizations and NVLink-aware placement. Provide launch recipes (
--tp 8 --ep 8, gradients off, paged KV on).
- Enable TP=8 + EP≥8 to fit experts and reduce per-GPU footprint; add all-to-all routing (NCCL) optimizations and NVLink-aware placement. Provide launch recipes (
-
Context scaling & rope
- Native context is 262 k (extensible with YaRN). On TurboMind/Volta you’ll likely ship defaults at 32–64 k first (given FP16 KV; no KV-INTx for head_dim=256). Later, add YaRN hooks in loader to reach ≥256 k. (Hugging Face)
老卡V100 全靠lmdeploy了
老卡V100 全靠lmdeploy了
Really? I am running lmdeploy with turbomind backend on 8 x v100s Pytorch no longer supports volta
用vllm跑不起来OOM也是没办法,kv cache挺大的,希望能支持,我就跑量化的 flash_attn只有最新2.8.3的才支持torch 2.8 vllm最新的只支持torch 2.8 但是kv cache量化要求flash_attn<=2.8.2
lmdeploy是老卡的科技之光
lmdeploy是老卡的科技之光
感觉很难,现在没有迹象会适配。llama.cpp也没适配
这个估计没这么快
Is there any chance for turbomind support with this model?
Is there any chance for turbomind support with this model?
Currently the architecture of turbomind struggles to adapt for new model architectures. We are doing multiple round of refactoring which tries to make turbomind friendly to new model arches and fancy generation schemes like MTP / dLLM.
At the point when the forementioned refactoring is complete (which is targeting end 2025), if the Qwen team continues to ship models consistent with the Qwen3-Next architecture we will try to support them all at once.