lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] quantization of internvl-chat-v1.5

Open xiangqi1997 opened this issue 9 months ago • 5 comments

Motivation

Very impressive work, noticed that a recent PR supported internvl-chat-v1.5, but it was not realistic to run on two 24G GPU cards. May I ask when will quantification be supported? In addition, if only quantify the LLM part, such as 4bits, can it meet the requirements?

Related resources

No response

Additional context

No response

xiangqi1997 avatar Apr 29 '24 12:04 xiangqi1997

@irexyc may evaluate the memory footprint

lvhan028 avatar Apr 29 '24 12:04 lvhan028

@irexyc Hi. 8bits for internvl-v1.5 has been released on https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-Int8/discussions , I am trying to load using two 24G GPUs, but I failed (OOM). What param can adjust for that? Seems '--quant-policy 8' not suits for me. Or it is just impossible?

xiangqi1997 avatar Apr 30 '24 09:04 xiangqi1997

也尝试了4卡 3090,0号卡剩余12G,其余空闲,但仍无法加载8bits模型(OOM);指令如下 lmdeploy serve api_server ~/.cache/huggingface/hub/models--OpenGVLab--InternVL-Chat-V1-5-Int8/snapshots/872c99216b9dd5f69ea610e160dcc8692f1ab214/ --backend turbomind --server-port 1234 --tp 4 --cache-max-entry-count 0.01

xiangqi1997 avatar Apr 30 '24 09:04 xiangqi1997

They use bitsandbytes to do dynamic quant which is not efficient for inference. It's better to use awq or gptq to do the quantization. And we don't support loading model in this bitsandbytes format.

The vision part takes up 13G and llm part takes about 37G and could be reduced to about 13G after quantization. Currently, the vision model is on cuda:0 which will limit the llm model that could be loaded and the kv cached that could be used.

Next mounth, we will balance the vision model to multi gpu and make better support for vlm model.

If you use two 24G cards, there will be about 10G left for kv-cache. And for v1.5 model, it will achieve maximum concurrency of 13 (if I cauculate right).

你的两个卡可以跑,不过两个工作需要做: vision model 的显存分配和VLM 模型 LLM部分的量化,显存分配我跑通了llava的,还需要适配其他的vision模型才能提PR。VLM 模型 LLM部分的量化,v1-5用的llm是internlm2-20b,这个是支持量化的,就是得把他拿出来,需要写一些mapping,另外我们awq的量化算法最近也更新了,需要在更多的模型上测试。整体可能会在下个月底才能支持。

irexyc avatar Apr 30 '24 10:04 irexyc

感谢回复

xiangqi1997 avatar Apr 30 '24 11:04 xiangqi1997

Supported in the latest main.

AllentDan avatar May 24 '24 08:05 AllentDan