llm-awq icon indicating copy to clipboard operation
llm-awq copied to clipboard

INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on A100

Open wanzhenchn opened this issue 10 months ago • 3 comments

INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on single A100 80GB PCIe with batch size 1, 2, 4, 8, 16 for prefill_length, decode length 32, 64, 128, 256, 512.

The detailed data is as following:

image

So, what's the problem?

wanzhenchn avatar Aug 23 '23 05:08 wanzhenchn