lmdeploy [Feature] 请求支持compressed-tensors的各种量化格式

Motivation

compressed-tensors是目前更主流的模型量化库，目前其AWQ等量化模型被vllm等推理引擎广泛支持。我注意到近期lmdeploy的更新频率非常缓慢，对于新模型、新量化库等支持趋于停滞，不知道还是否有跟进社区主流生态的计划？

Related resources

No response

Additional context

No response

Sep 01 '25 06:09 BUJIDAOVS

Hi，感谢支持与反馈。

1、compressed-tensors是目前更主流的模型量化库，目前其AWQ等量化模型被vllm等推理引擎广泛支持。 我们确实注意到了 compressed-tensors 这一个模型量化库（例如 GLM4.5 的量化模型权重就是以这个格式存储的），但是鉴于使用量较高的一些开源模型（Qwen3、Qwen2.5-VL、InternLM、InternVL、InternS1、GPT-OSS）官方似乎没有这个格式的权重，因此暂时还未支持

2、对于新模型、新量化库等支持趋于停滞，不知道还是否有跟进社区主流生态的计划？ 近期 LMDeploy 更新了

https://github.com/InternLM/lmdeploy/pull/3846
https://github.com/InternLM/lmdeploy/pull/3863
https://github.com/InternLM/lmdeploy/pull/3886
https://github.com/InternLM/lmdeploy/pull/3820

等一系列的热点模型，请问是否有覆盖到您的使用场景？如果暂未支持您想使用的模型，可以提相关 issue 方便我们知晓模型类型，安排适配工作

Sep 01 '25 10:09 CUHKSZzxy

Hi，感谢支持与反馈。

1、compressed-tensors是目前更主流的模型量化库，目前其AWQ等量化模型被vllm等推理引擎广泛支持。 我们确实注意到了 compressed-tensors 这一个模型量化库（例如 GLM4.5 的量化模型权重就是以这个格式存储的），但是鉴于使用量较高的一些开源模型（Qwen3、Qwen2.5-VL、InternLM、InternVL、InternS1、GPT-OSS）暂时没有这个格式的权重，因此暂时还未支持

2、对于新模型、新量化库等支持趋于停滞，不知道还是否有跟进社区主流生态的计划？ 近期 LMDeploy 更新了

Support GLM-4-0414 and GLM-4.1V #3846

Support GLM-4.5 #3863

support internvl3.5 #3886

PytorchEngine support gpt-oss bf16 #3820

等一系列的热点模型，请问是否有覆盖到您的使用场景？如果暂未支持您想使用的模型，可以提相关 issue 方便我们知晓模型类型，安排适配工作

感谢回复，lmdeploy的turbomind是我长期以来使用最多的awq推理后端。但是自qwen3 2507版本更新以来，使用autoawq量化的awq模型(gemm)效果并不理想，表现出了较大的精度损失。因此社区正在逐渐放弃旧的awq量化方式。huggingface中下载量可以体现这一趋势，使用量最高的qwen awq模型目前都是采用compressed-tensors量化的。

"quantization_config": {
--
"config_groups": {
"group_0": {
"format": "pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 32,
"num_bits": 4,
"observer": "mse",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
}
},
"format": "pack-quantized",
"global_compression_ratio": null,
"ignore": [
"model.layers.0.mlp.gate",
"model.layers.1.mlp.gate",
"model.layers.2.mlp.gate",
"model.layers.3.mlp.gate",
"model.layers.4.mlp.gate",
"model.layers.5.mlp.gate",
"model.layers.6.mlp.gate",
"model.layers.7.mlp.gate",
"model.layers.8.mlp.gate",
"model.layers.9.mlp.gate",
"model.layers.10.mlp.gate",
"model.layers.11.mlp.gate",
"model.layers.12.mlp.gate",
"model.layers.13.mlp.gate",
"model.layers.14.mlp.gate",
"model.layers.15.mlp.gate",
"model.layers.16.mlp.gate",
"model.layers.17.mlp.gate",
"model.layers.18.mlp.gate",
"model.layers.19.mlp.gate",
"model.layers.20.mlp.gate",
"model.layers.21.mlp.gate",
"model.layers.22.mlp.gate",
"model.layers.23.mlp.gate",
"model.layers.24.mlp.gate",
"model.layers.25.mlp.gate",
"model.layers.26.mlp.gate",
"model.layers.27.mlp.gate",
"model.layers.28.mlp.gate",
"model.layers.29.mlp.gate",
"model.layers.30.mlp.gate",
"model.layers.31.mlp.gate",
"model.layers.32.mlp.gate",
"model.layers.33.mlp.gate",
"model.layers.34.mlp.gate",
"model.layers.35.mlp.gate",
"model.layers.36.mlp.gate",
"model.layers.37.mlp.gate",
"model.layers.38.mlp.gate",
"model.layers.39.mlp.gate",
"model.layers.40.mlp.gate",
"model.layers.41.mlp.gate",
"model.layers.42.mlp.gate",
"model.layers.43.mlp.gate",
"model.layers.44.mlp.gate",
"model.layers.45.mlp.gate",
"model.layers.46.mlp.gate",
"model.layers.47.mlp.gate",
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {},
"transform_config": {},
"version": "0.10.3.dev47+ge463fe6"
}

正如我先前所说，从推理速度、显存占用等各个角度，lmdeploy的turbomind过去都是awq量化模型的非常好的推理后端，因此支持compressed-tensors量化的awq模型实际上是许多lmdeploy用户的期待，不少issue也提到了此需求。希望能够对其提供支持，感谢。

Sep 01 '25 10:09 BUJIDAOVS

Sure. We'll come back to this feature after v0.10.0. This version is likely going to be released this week