What would have to change in LMDeploy TurboMind (for Volta)

Model registration / config parsing
- Add a qwen3_next architecture to TurboMind: parse the HF config keys (hybrid layout, GA/DeltaNet head counts, RoPE dim=64, KV heads=2, GA head_dim=256; DeltaNet head_dim=128; 48 layers; 512 experts, Top-10 activated, 1 shared expert). Map these to TurboMind layer objects. (Hugging Face)
New attention kernels
- Gated Attention (GA): implement/fuse Q/K/V projections, gating, RoPE(64), and attention with (Q heads=16, KV heads=2, head_dim=256).
- Gated DeltaNet: implement the linear-attention style “DeltaNet” path with its gating; separate kernels for QK (16 heads) and V (32 heads), head_dim=128.
- Integrate both into TurboMind’s decode/kv-management and paged-KV flow (Flash-/Paged-decoding analogues). These blocks don’t exist in TurboMind today. (Hugging Face)
MoE routing at “Top-10-of-512 (+1 shared)”
- Implement token-router → top-k expert selection, capacity & dispatch, with expert parallel (EP) or equivalent sharding; otherwise all 512 experts must be resident per GPU, which is impractical. TurboMind has MoE support, but scaling to 512 experts with Top-10 dispatch requires efficient all-to-all and placement policies across TP×EP on NVLink. (lmdeploy.readthedocs.io)
Volta dtype path (no BF16/FP8 on V100)
- Add robust BF16→FP16 weight cast on load (and ensure numerics for zero-centered/weight-decayed layernorm remain stable in FP16).
- Keep compute in FP16; FP8 kernels are not an option on V100. (V100 Tensor Cores accelerate FP16 only; BF16 arrived with Ampere; FP8 with Hopper.) (NVIDIA Images)
KV-cache quantization reality
- TurboMind does not support 4/8-bit KV when head_dim≠128. GA uses head_dim=256, so you must run FP16 KV (or extend TurboMind to support KV-INTx for non-128 dims). Expect materially higher KV memory at long context. (lmdeploy.readthedocs.io)
Speculative/MTP path (optional, for speed)
- Qwen3-Next includes MTP; to match SGLang/vLLM performance, you’d add a speculative decoding plugin (e.g., NEXTN/Eagle variants). If omitted, it still runs—just slower. (Hugging Face)
Weight loader & tensor mapping
- Add safetensor→TurboMind weight remapping for qwen3_next (distinct parameter names for GA/DeltaNet/MoE; shared expert). Provide a converter that flattens experts for EP sharding.
Distributed runtime
- Enable TP=8 + EP≥8 to fit experts and reduce per-GPU footprint; add all-to-all routing (NCCL) optimizations and NVLink-aware placement. Provide launch recipes (--tp 8 --ep 8, gradients off, paged KV on).
Context scaling & rope
- Native context is 262 k (extensible with YaRN). On TurboMind/Volta you’ll likely ship defaults at 32–64 k first (given FP16 KV; no KV-INTx for head_dim=256). Later, add YaRN hooks in loader to reach ≥256 k. (Hugging Face)

Sep 13 '25 15:09 ga-it

老卡V100 全靠lmdeploy了

Sep 14 '25 03:09 yanmindi

老卡V100 全靠lmdeploy了

Really? I am running lmdeploy with turbomind backend on 8 x v100s Pytorch no longer supports volta

Sep 14 '25 09:09 ga-it

用vllm跑不起来OOM也是没办法，kv cache挺大的，希望能支持，我就跑量化的 flash_attn只有最新2.8.3的才支持torch 2.8 vllm最新的只支持torch 2.8 但是kv cache量化要求flash_attn<=2.8.2

Sep 15 '25 06:09 LIUKAI0815

lmdeploy是老卡的科技之光

Sep 16 '25 03:09 warlockedward

lmdeploy是老卡的科技之光

感觉很难，现在没有迹象会适配。llama.cpp也没适配

Sep 24 '25 08:09 LIUKAI0815

这个估计没这么快

Sep 26 '25 11:09 lzhangzz

Is there any chance for turbomind support with this model?

Nov 11 '25 16:11 tuilakhanh

Is there any chance for turbomind support with this model?

Currently the architecture of turbomind struggles to adapt for new model architectures. We are doing multiple round of refactoring which tries to make turbomind friendly to new model arches and fancy generation schemes like MTP / dLLM.

At the point when the forementioned refactoring is complete (which is targeting end 2025), if the Qwen team continues to ship models consistent with the Qwen3-Next architecture we will try to support them all at once.

Nov 12 '25 09:11 lzhangzz

[Feature] Qwen3-Next-80B-A3B Support

Motivation

Related resources

Additional context

What would have to change in LMDeploy TurboMind (for Volta)