Li Zhang comments

Results 72 comments of


                                            Li Zhang

trafficstars

[Feature] 我们支持gptq量化模型的推理么

#2090 adds support for both AWQ and GPTQ models on V100.

@eigen2017 目前 GPTQ 只支持 group_size=128, desc_act=False 的情况（大部分 Qwen 系列提供的 GPTQ 版本模型）。直接改 quantization config 不能改变权重本身的性质。 group_size=-1 的模型可以把 scales 和 qzeros 重复 ceil_div(input_dims, 128) 遍转成 group_size=128 的。desc_act 需要多几个重排操作，目前还没有实现。

[Feature] V100量化推理

Almost there, W4A16 kernel for V100 has already been verified. Still need some time to put all the things together, it's a big update.

[Bug] lmdeploy awq量化后不能多卡部署

这个问题会在 #2090 解决

[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

@josephrocca @fanghostt Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs.

[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

@josephrocca Sorry for the confusion. Internet access is quite limited on our 4090 environment so I started with what I already have on the machine.

[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

@josephrocca In my test with Llama3 70B AWQ on 2x4090, `--cache-max-entry-count 0.5` is needed to avoid OOM.

Li Zhang

[Feature] 我们支持gptq量化模型的推理么

[Feature] 我们支持gptq量化模型的推理么

[Feature] V100量化推理

[Bug] lmdeploy awq量化后不能多卡部署

[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

lmdeploy对DeepSeek-R1-Distill-Llama-70B进行api推理时，使用H100的两张显卡，共160G，会出现out of memory

[Bug] 2卡internvl2-26b推理，卡间通信是pcie会失败，nvlink会成功，这是为啥

[Bug] 2卡internvl2-26b推理，卡间通信是pcie会失败，nvlink会成功，这是为啥