llm-awq awq use more GPU memory than gptq

trafficstars

We tested the llama model using AWQ and GPTQ. It does have higher accuracy than GPTQ.

But we found that when using AWQ code to infer the llama model, it uses more GPU memory than GPTQ.

The following are the relevant test results：

For llama-7b w4 group_size=128, the quantized model size is 3.7G.

use A100 40GB and test on human-eval

GPTQ

use_cache=True Maximum memory used：9.505859375GB
use_cache=False Maximum memory used：9.115234375GB

AWQ

use_cache=True Maximum memory used：26.47265625GB
use_cache=False Maximum memory used：36.96484375GB

There are two points to pay attention to the above results.

In the inference stage, GPTQ can use less memory than AWQ
For AWQ, use_cache=False uses more memory( usually use_cache=True requires more memory)

use_cache=False We use GPTQ script to infer 4bit llama-65b, which can be run on a single GPU. When using AWQ, the OOM will occur.

I would like to ask if you have any of the above problems during the test. Could you please provide your thoughts on the above issues? Thank you so much.

I noticed that in the forward phase, the main difference between GPTQ and AWQ is that AWQ uses Tensor cores (I am not familiar with the contents of tensor cores). Will Tensor cores cause more memory usage?

Aug 01 '23 03:08 lyg95

Hi, great to hear that you observe a better accuracy!

The memory usage part is weird. When we test AWQ, it uses similar memory size as GPTQ. Could you provide the script you used to run inference with AWQ? Thanks!

Aug 01 '23 19:08 tonylins

Thanks for your reply. I conducted a more detailed analysis of the quantization model. I find that the aforementioned problem is not caused by the awq algorithm. It should be related to the test tasks as well as the model configuration. For the issues related to use_cache, I will analyze the code myself later.

Aug 03 '23 03:08 lyg95

llm-awq llm-awq copied to clipboard

awq use more GPU memory than gptq

llm-awq
llm-awq copied to clipboard