[Feature] 请求支持compressed-tensors的各种量化格式
Motivation
compressed-tensors是目前更主流的模型量化库,目前其AWQ等量化模型被vllm等推理引擎广泛支持。我注意到近期lmdeploy的更新频率非常缓慢,对于新模型、新量化库等支持趋于停滞,不知道还是否有跟进社区主流生态的计划?
Related resources
No response
Additional context
No response
Hi,感谢支持与反馈。
1、compressed-tensors是目前更主流的模型量化库,目前其AWQ等量化模型被vllm等推理引擎广泛支持。
我们确实注意到了 compressed-tensors 这一个模型量化库(例如 GLM4.5 的量化模型权重就是以这个格式存储的),但是鉴于使用量较高的一些开源模型(Qwen3、Qwen2.5-VL、InternLM、InternVL、InternS1、GPT-OSS) 官方似乎没有这个格式的权重,因此暂时还未支持
2、对于新模型、新量化库等支持趋于停滞,不知道还是否有跟进社区主流生态的计划?
近期 LMDeploy 更新了
- https://github.com/InternLM/lmdeploy/pull/3846
- https://github.com/InternLM/lmdeploy/pull/3863
- https://github.com/InternLM/lmdeploy/pull/3886
- https://github.com/InternLM/lmdeploy/pull/3820
等一系列的热点模型,请问是否有覆盖到您的使用场景?如果暂未支持您想使用的模型,可以提相关 issue 方便我们知晓模型类型,安排适配工作
Hi,感谢支持与反馈。
1、
compressed-tensors是目前更主流的模型量化库,目前其AWQ等量化模型被vllm等推理引擎广泛支持。我们确实注意到了compressed-tensors这一个模型量化库(例如 GLM4.5 的量化模型权重就是以这个格式存储的),但是鉴于使用量较高的一些开源模型(Qwen3、Qwen2.5-VL、InternLM、InternVL、InternS1、GPT-OSS) 暂时没有这个格式的权重,因此暂时还未支持2、
对于新模型、新量化库等支持趋于停滞,不知道还是否有跟进社区主流生态的计划?近期 LMDeploy 更新了
- Support GLM-4-0414 and GLM-4.1V #3846
- Support GLM-4.5 #3863
- support internvl3.5 #3886
- PytorchEngine support gpt-oss bf16 #3820
等一系列的热点模型,请问是否有覆盖到您的使用场景?如果暂未支持您想使用的模型,可以提相关 issue 方便我们知晓模型类型,安排适配工作
感谢回复,lmdeploy的turbomind是我长期以来使用最多的awq推理后端。但是自qwen3 2507版本更新以来,使用autoawq量化的awq模型(gemm)效果并不理想,表现出了较大的精度损失。因此社区正在逐渐放弃旧的awq量化方式。huggingface中下载量可以体现这一趋势,使用量最高的qwen awq模型目前都是采用compressed-tensors量化的。
"quantization_config": {
--
"config_groups": {
"group_0": {
"format": "pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 32,
"num_bits": 4,
"observer": "mse",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
}
},
"format": "pack-quantized",
"global_compression_ratio": null,
"ignore": [
"model.layers.0.mlp.gate",
"model.layers.1.mlp.gate",
"model.layers.2.mlp.gate",
"model.layers.3.mlp.gate",
"model.layers.4.mlp.gate",
"model.layers.5.mlp.gate",
"model.layers.6.mlp.gate",
"model.layers.7.mlp.gate",
"model.layers.8.mlp.gate",
"model.layers.9.mlp.gate",
"model.layers.10.mlp.gate",
"model.layers.11.mlp.gate",
"model.layers.12.mlp.gate",
"model.layers.13.mlp.gate",
"model.layers.14.mlp.gate",
"model.layers.15.mlp.gate",
"model.layers.16.mlp.gate",
"model.layers.17.mlp.gate",
"model.layers.18.mlp.gate",
"model.layers.19.mlp.gate",
"model.layers.20.mlp.gate",
"model.layers.21.mlp.gate",
"model.layers.22.mlp.gate",
"model.layers.23.mlp.gate",
"model.layers.24.mlp.gate",
"model.layers.25.mlp.gate",
"model.layers.26.mlp.gate",
"model.layers.27.mlp.gate",
"model.layers.28.mlp.gate",
"model.layers.29.mlp.gate",
"model.layers.30.mlp.gate",
"model.layers.31.mlp.gate",
"model.layers.32.mlp.gate",
"model.layers.33.mlp.gate",
"model.layers.34.mlp.gate",
"model.layers.35.mlp.gate",
"model.layers.36.mlp.gate",
"model.layers.37.mlp.gate",
"model.layers.38.mlp.gate",
"model.layers.39.mlp.gate",
"model.layers.40.mlp.gate",
"model.layers.41.mlp.gate",
"model.layers.42.mlp.gate",
"model.layers.43.mlp.gate",
"model.layers.44.mlp.gate",
"model.layers.45.mlp.gate",
"model.layers.46.mlp.gate",
"model.layers.47.mlp.gate",
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {},
"transform_config": {},
"version": "0.10.3.dev47+ge463fe6"
}
正如我先前所说,从推理速度、显存占用等各个角度,lmdeploy的turbomind过去都是awq量化模型的非常好的推理后端,因此支持compressed-tensors量化的awq模型实际上是许多lmdeploy用户的期待,不少issue也提到了此需求。希望能够对其提供支持,感谢。
Sure. We'll come back to this feature after v0.10.0. This version is likely going to be released this week
Woo, amazing, I'm waiting for release!
+1
+1
Look forward to use this feature as soon as possible.
any progress?
sorry, folks. The team's work bandwidth is full and cannot support this requirement at the moment.
lmdeploy lite auto_awq已经无法量化新的模型了,如glm4.6,hf上也找不到新模型的auto-awq/gptq的量化了。