lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] 请求支持compressed-tensors的各种量化格式

Open BUJIDAOVS opened this issue 3 months ago • 10 comments

Motivation

compressed-tensors是目前更主流的模型量化库,目前其AWQ等量化模型被vllm等推理引擎广泛支持。我注意到近期lmdeploy的更新频率非常缓慢,对于新模型、新量化库等支持趋于停滞,不知道还是否有跟进社区主流生态的计划?

Related resources

No response

Additional context

No response

BUJIDAOVS avatar Sep 01 '25 06:09 BUJIDAOVS

Hi,感谢支持与反馈。

1、compressed-tensors是目前更主流的模型量化库,目前其AWQ等量化模型被vllm等推理引擎广泛支持。 我们确实注意到了 compressed-tensors 这一个模型量化库(例如 GLM4.5 的量化模型权重就是以这个格式存储的),但是鉴于使用量较高的一些开源模型(Qwen3、Qwen2.5-VL、InternLM、InternVL、InternS1、GPT-OSS) 官方似乎没有这个格式的权重,因此暂时还未支持

2、对于新模型、新量化库等支持趋于停滞,不知道还是否有跟进社区主流生态的计划? 近期 LMDeploy 更新了

  • https://github.com/InternLM/lmdeploy/pull/3846
  • https://github.com/InternLM/lmdeploy/pull/3863
  • https://github.com/InternLM/lmdeploy/pull/3886
  • https://github.com/InternLM/lmdeploy/pull/3820

等一系列的热点模型,请问是否有覆盖到您的使用场景?如果暂未支持您想使用的模型,可以提相关 issue 方便我们知晓模型类型,安排适配工作

CUHKSZzxy avatar Sep 01 '25 10:09 CUHKSZzxy

Hi,感谢支持与反馈。

1、compressed-tensors是目前更主流的模型量化库,目前其AWQ等量化模型被vllm等推理引擎广泛支持。 我们确实注意到了 compressed-tensors 这一个模型量化库(例如 GLM4.5 的量化模型权重就是以这个格式存储的),但是鉴于使用量较高的一些开源模型(Qwen3、Qwen2.5-VL、InternLM、InternVL、InternS1、GPT-OSS) 暂时没有这个格式的权重,因此暂时还未支持

2、对于新模型、新量化库等支持趋于停滞,不知道还是否有跟进社区主流生态的计划? 近期 LMDeploy 更新了

等一系列的热点模型,请问是否有覆盖到您的使用场景?如果暂未支持您想使用的模型,可以提相关 issue 方便我们知晓模型类型,安排适配工作


感谢回复,lmdeploy的turbomind是我长期以来使用最多的awq推理后端。但是自qwen3 2507版本更新以来,使用autoawq量化的awq模型(gemm)效果并不理想,表现出了较大的精度损失。因此社区正在逐渐放弃旧的awq量化方式。huggingface中下载量可以体现这一趋势,使用量最高的qwen awq模型目前都是采用compressed-tensors量化的。

Image Image
"quantization_config": {
--
"config_groups": {
"group_0": {
"format": "pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 32,
"num_bits": 4,
"observer": "mse",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
}
},
"format": "pack-quantized",
"global_compression_ratio": null,
"ignore": [
"model.layers.0.mlp.gate",
"model.layers.1.mlp.gate",
"model.layers.2.mlp.gate",
"model.layers.3.mlp.gate",
"model.layers.4.mlp.gate",
"model.layers.5.mlp.gate",
"model.layers.6.mlp.gate",
"model.layers.7.mlp.gate",
"model.layers.8.mlp.gate",
"model.layers.9.mlp.gate",
"model.layers.10.mlp.gate",
"model.layers.11.mlp.gate",
"model.layers.12.mlp.gate",
"model.layers.13.mlp.gate",
"model.layers.14.mlp.gate",
"model.layers.15.mlp.gate",
"model.layers.16.mlp.gate",
"model.layers.17.mlp.gate",
"model.layers.18.mlp.gate",
"model.layers.19.mlp.gate",
"model.layers.20.mlp.gate",
"model.layers.21.mlp.gate",
"model.layers.22.mlp.gate",
"model.layers.23.mlp.gate",
"model.layers.24.mlp.gate",
"model.layers.25.mlp.gate",
"model.layers.26.mlp.gate",
"model.layers.27.mlp.gate",
"model.layers.28.mlp.gate",
"model.layers.29.mlp.gate",
"model.layers.30.mlp.gate",
"model.layers.31.mlp.gate",
"model.layers.32.mlp.gate",
"model.layers.33.mlp.gate",
"model.layers.34.mlp.gate",
"model.layers.35.mlp.gate",
"model.layers.36.mlp.gate",
"model.layers.37.mlp.gate",
"model.layers.38.mlp.gate",
"model.layers.39.mlp.gate",
"model.layers.40.mlp.gate",
"model.layers.41.mlp.gate",
"model.layers.42.mlp.gate",
"model.layers.43.mlp.gate",
"model.layers.44.mlp.gate",
"model.layers.45.mlp.gate",
"model.layers.46.mlp.gate",
"model.layers.47.mlp.gate",
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {},
"transform_config": {},
"version": "0.10.3.dev47+ge463fe6"
}

正如我先前所说,从推理速度、显存占用等各个角度,lmdeploy的turbomind过去都是awq量化模型的非常好的推理后端,因此支持compressed-tensors量化的awq模型实际上是许多lmdeploy用户的期待,不少issue也提到了此需求。希望能够对其提供支持,感谢。

BUJIDAOVS avatar Sep 01 '25 10:09 BUJIDAOVS

Sure. We'll come back to this feature after v0.10.0. This version is likely going to be released this week

lvhan028 avatar Sep 01 '25 11:09 lvhan028

Woo, amazing, I'm waiting for release!

kukukalaz avatar Sep 04 '25 13:09 kukukalaz

+1

oldnetdog avatar Sep 12 '25 10:09 oldnetdog

+1

warlockedward avatar Sep 24 '25 05:09 warlockedward

Look forward to use this feature as soon as possible.

Huarong avatar Oct 12 '25 06:10 Huarong

any progress?

zzc98 avatar Oct 16 '25 01:10 zzc98

sorry, folks. The team's work bandwidth is full and cannot support this requirement at the moment.

lvhan028 avatar Oct 16 '25 11:10 lvhan028

lmdeploy lite auto_awq已经无法量化新的模型了,如glm4.6,hf上也找不到新模型的auto-awq/gptq的量化了。

zh-nj avatar Nov 24 '25 00:11 zh-nj