llm-awq
llm-awq copied to clipboard
INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on A100
INT4 quantization only delievers 20%~35% faster inference performance than FP16 for the LLaMA-13b on single A100 80GB PCIe with batch size 1, 2, 4, 8, 16 for prefill_length, decode length 32, 64, 128, 256, 512.
The detailed data is as following:
So, what's the problem?