Li Zhang
Li Zhang
#2090 adds support for both AWQ and GPTQ models on V100.
@eigen2017 目前 GPTQ 只支持 group_size=128, desc_act=False 的情况(大部分 Qwen 系列提供的 GPTQ 版本模型)。 直接改 quantization config 不能改变权重本身的性质。 group_size=-1 的模型可以把 scales 和 qzeros 重复 ceil_div(input_dims, 128) 遍转成 group_size=128 的。desc_act 需要多几个重排操作,目前还没有实现。
Almost there, W4A16 kernel for V100 has already been verified. Still need some time to put all the things together, it's a big update.
这个问题会在 #2090 解决
@josephrocca @fanghostt Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs.
@josephrocca Sorry for the confusion. Internet access is quite limited on our 4090 environment so I started with what I already have on the machine.
@josephrocca In my test with Llama3 70B AWQ on 2x4090, `--cache-max-entry-count 0.5` is needed to avoid OOM.
> CUDA_VISIBLE_DEVICES=3,4 lmdeploy serve api_server /app123/model/DeepSeek-R1-Distill-Llama-70B --backend turbomind --server-port 8000 --device cuda --chat-template deepseek 需要加上 --tp 2
@chestnut111 麻烦贴一下`python3 -m lmdeploy check_env` 的输出
建议试试设置 NCCL_P2P_DISABLE=1