llm-awq INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on A100

INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on A100

Open wanzhenchn opened this issue 10 months ago • 3 comments

INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on single A100 80GB PCIe with batch size 1, 2, 4, 8, 16 for prefill_length, decode length 32, 64, 128, 256, 512.

The detailed data is as following:

So, what's the problem?

Aug 23 '23 05:08 wanzhenchn

llm-awq llm-awq copied to clipboard

INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on A100

llm-awq
llm-awq copied to clipboard