llm-awq Request to recommend a related project LMDeploy

We have integrated this incredible work into our project LMDeploy which completes the LLM deployment toolkit, including compressing, inference, and serving.

Additionally, by extensively optimizing the W4A16 kernel, LMDeploy significantly improves the inference performance, as detailed in our benchmark results below.

We would be grateful if you would consider referencing our project in your README, as we believe it could be useful for your users. Please kindly let me know if you have any other questions.

We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090. And we measured the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single-batch inference.

model	mlc-llm	turbomind
Llama-2-7B-chat	159.4	206.4
Llama-2-13B-chat	90.7	115.8

Memory (GB) comparison results between 4-bit and 16-bit models with context sizes 2048 and 4096 respectively,

model	16bit(2048)	4bit(2048)	16bit(4096)	4bit(4096)
Llama-2-7B-chat	15.1	6.3	16.2	7.5
Llama-2-13B-chat	OOM	10.3	OOM	12.0

Aug 23 '23 03:08 lvhan028

Could you give some more performance data with fp16 precision? How does the performance of INT4 compared to FP16?

Aug 23 '23 06:08 wanzhenchn

Sure. The following is LMDeploy's performance of INT4 compared to FP16 on A100-80G with different batch sizes. The test model is llama-2-7B

batch	prompt tokens	completion tokens	throughput(w4) token/s	throughput(fp16) token/s
1	1	512	236.6	96.95
4	1	512	893.79	376.63
8	1	512	1648.91	729.38
16	1	512	2763.53	1360.61
32	1	512	3696.35	2386.44
64	1	512	4708.82	3799.91

Aug 23 '23 07:08 lvhan028

Many thanks for your sharing.

Have you tested with different prompt tokens and completion tokens? for example:

batch	prompt tokens	completion tokens
1	64	64
4	128	128
8	256	512
...	...	...

Aug 23 '23 07:08 wanzhenchn

Not yet. We'll expand the benchmark matrix as soon as possible

Aug 23 '23 07:08 lvhan028

Sure. The following is LMDeploy's performance of INT4 compared to FP16 on A100-80G with different batch sizes. The test model is llama-2-7B

batch prompt tokens completion tokens throughput(w4) token/s throughput(fp16) token/s 1 1 512 236.6 96.95 4 1 512 893.79 376.63 8 1 512 1648.91 729.38 16 1 512 2763.53 1360.61 32 1 512 3696.35 2386.44 64 1 512 4708.82 3799.91

As the batch size increases, the acceleration effect of INT4 compared to FP16 gradually diminishes. Why?

Aug 23 '23 07:08 wanzhenchn

As the batch size increases, the acceleration effect of INT4 compared to FP16 gradually diminishes. Why?

When batch size is small, inference is memory bounded (a full sweep of all layer weights in every step). Thus given constant memory bandwidth, smaller weights lead to faster time/step.

When batch size is large, computation cost out-weights memory operations. Since W4A16 and W16A16 both use FP16 for computation, they have the same computation cost on MMA. The W4 version has additional cost for dequantizing the weights on-the-fly. We hide the dequantization cost through careful software pipelining.

Aug 23 '23 07:08 lzhangzz

We have integrated this incredible work into our project LMDeploy which completes the LLM deployment toolkit, including compressing, inference, and serving.

Additionally, by extensively optimizing the W4A16 kernel, LMDeploy significantly improves the inference performance, as detailed in our benchmark results below.

We would be grateful if you would consider referencing our project in your README, as we believe it could be useful for your users. Please kindly let me know if you have any other questions.

We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090. And we measured the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single-batch inference.

model mlc-llm turbomind Llama-2-7B-chat 159.4 206.4 Llama-2-13B-chat 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit models with context sizes 2048 and 4096 respectively,

model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0

Your results look good. Would appreciate if you can look into more models like MPT for LMDeploy with W4A16 as AWQ has support for many models

Aug 23 '23 08:08 casper-hansen

llm-awq llm-awq copied to clipboard

Request to recommend a related project LMDeploy

llm-awq
llm-awq copied to clipboard