llm-awq
llm-awq copied to clipboard
Request to recommend a related project LMDeploy
We have integrated this incredible work into our project LMDeploy which completes the LLM deployment toolkit, including compressing, inference, and serving.
Additionally, by extensively optimizing the W4A16 kernel, LMDeploy significantly improves the inference performance, as detailed in our benchmark results below.
We would be grateful if you would consider referencing our project in your README, as we believe it could be useful for your users. Please kindly let me know if you have any other questions.
We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090. And we measured the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single-batch inference.
| model | mlc-llm | turbomind |
|---|---|---|
| Llama-2-7B-chat | 159.4 | 206.4 |
| Llama-2-13B-chat | 90.7 | 115.8 |
Memory (GB) comparison results between 4-bit and 16-bit models with context sizes 2048 and 4096 respectively,
| model | 16bit(2048) | 4bit(2048) | 16bit(4096) | 4bit(4096) |
|---|---|---|---|---|
| Llama-2-7B-chat | 15.1 | 6.3 | 16.2 | 7.5 |
| Llama-2-13B-chat | OOM | 10.3 | OOM | 12.0 |
Could you give some more performance data with fp16 precision? How does the performance of INT4 compared to FP16?
Sure. The following is LMDeploy's performance of INT4 compared to FP16 on A100-80G with different batch sizes. The test model is llama-2-7B
| batch | prompt tokens | completion tokens | throughput(w4) token/s | throughput(fp16) token/s |
|---|---|---|---|---|
| 1 | 1 | 512 | 236.6 | 96.95 |
| 4 | 1 | 512 | 893.79 | 376.63 |
| 8 | 1 | 512 | 1648.91 | 729.38 |
| 16 | 1 | 512 | 2763.53 | 1360.61 |
| 32 | 1 | 512 | 3696.35 | 2386.44 |
| 64 | 1 | 512 | 4708.82 | 3799.91 |
Many thanks for your sharing.
Have you tested with different prompt tokens and completion tokens? for example:
| batch | prompt tokens | completion tokens | throughput(w4) token/s | throughput(fp16) token/s |
|---|---|---|---|---|
| 1 | 64 | 64 | ||
| 4 | 128 | 128 | ||
| 8 | 256 | 512 | ||
| ... | ... | ... |
Not yet. We'll expand the benchmark matrix as soon as possible
Sure. The following is LMDeploy's performance of INT4 compared to FP16 on A100-80G with different batch sizes. The test model is llama-2-7B
batch prompt tokens completion tokens throughput(w4) token/s throughput(fp16) token/s 1 1 512 236.6 96.95 4 1 512 893.79 376.63 8 1 512 1648.91 729.38 16 1 512 2763.53 1360.61 32 1 512 3696.35 2386.44 64 1 512 4708.82 3799.91
As the batch size increases, the acceleration effect of INT4 compared to FP16 gradually diminishes. Why?
As the batch size increases, the acceleration effect of INT4 compared to FP16 gradually diminishes. Why?
When batch size is small, inference is memory bounded (a full sweep of all layer weights in every step). Thus given constant memory bandwidth, smaller weights lead to faster time/step.
When batch size is large, computation cost out-weights memory operations. Since W4A16 and W16A16 both use FP16 for computation, they have the same computation cost on MMA. The W4 version has additional cost for dequantizing the weights on-the-fly. We hide the dequantization cost through careful software pipelining.
We have integrated this incredible work into our project LMDeploy which completes the LLM deployment toolkit, including compressing, inference, and serving.
Additionally, by extensively optimizing the W4A16 kernel, LMDeploy significantly improves the inference performance, as detailed in our benchmark results below.
We would be grateful if you would consider referencing our project in your README, as we believe it could be useful for your users. Please kindly let me know if you have any other questions.
We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090. And we measured the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single-batch inference.
model mlc-llm turbomind Llama-2-7B-chat 159.4 206.4 Llama-2-13B-chat 90.7 115.8 Memory (GB) comparison results between 4-bit and 16-bit models with context sizes 2048 and 4096 respectively,
model 16bit(2048) 4bit(2048) 16bit(4096) 4bit(4096) Llama-2-7B-chat 15.1 6.3 16.2 7.5 Llama-2-13B-chat OOM 10.3 OOM 12.0
Your results look good. Would appreciate if you can look into more models like MPT for LMDeploy with W4A16 as AWQ has support for many models