llm-awq
llm-awq copied to clipboard
the question about the speed of AWQ && GPTQ
" Thanks to our efficient kernels, AWQ achieves 1.45× and 2× speedup over GPTQ and GPTQ with reordering on A100. It is also 1.85× faster than an FP16 cuBLAS implementation "
I have tested AWQ and GPTQ on llama-7b, but I do not see a speedup mentioned in the paper. It's true that GPTQ without act-order faster than with act-order.
test environment: GPU: A100 40GB
GPTQ: GPTQ-for-LLaMa triton
The following is the test process myself.
input = 'hf' * (input_length / 2)
max_len = input_length + output_lenght
start_time = timeit.default_timers()
encode input .... ...
for _ in (repetition):
model.generate()
decode output
record the number of tokens
compute elapsed time
compute throughput = total_tokens / elapse time
Here I supply some test results for reference.
input_length: 512 generated length(include input): 768 use_cache=True
fp16: 142.641 tokens/sec awq: 144.663 tokens/sec gptq - no act_order : 138.138 tokens/sec
input_length: 1024 generated length(include input): 1140 use_cache=True
fp16: 441.522 tokens/sec awq: 373.970 tokens/sec gptq: 382.077 tokens/sec
I also test the throughput of the generate process, excluding post-processing. It also has similar results.
when use_cache=False, awq always slower than gptq in my own tests.
Excuse me, is there any problem with my above test process, which resulted in the failure to reproduce the results of the paper.
Can you provide details about the test?
Thank you.
Which CPU is used? And can you post your full code including how you load the models?
Also, it looks like you did not try out TinyChat which offers a 2x faster inference in my experience. This is the fastest that you can run AWQ models right now.
EDIT: To answer your question as well, I believe they measure it differently than you do. They most likely measured GPTQ Triton matrix-vector speed vs the AWQ matrix-matrix speed as mentioned in the paper:
Unlike GPTQ [17] which formulates linear layers as matrix-vector (MV) products, we instead model these layers as matrix-matrix (MM) multiplications. MV can only be executed on slow CUDA cores while MM can be executed on the 16× faster tensor cores on A100 and H100. Our formulation also minimizes structural hazards compared to [17] since MM and other instructions such as dequantization and memory access are executed on separate function units. We also outperform a recent Triton [46] implementation [40] for GPTQ by 2.4× since it relies on a high-level language and forgoes opportunities for low-level optimizations
The CPU info:Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz When I tested, I use numactl to bind a single CPU core.
Of course, I provide the code for my test here,llama_awq-gptq_infer_test.txt
For TinyChat, I pull the code before TinyChat merged, I haven't tested it and I will do a test.
Here is some feedback.
- This part should not be a loop, just run
tokenizer.decodeongeneration_outputand usetoken_num += len(generation_output).
for output in generation_output:
tokenizer.decode(output, skip_special_tokens=True)
token_num += output.shape[-1]
- Your use of
num_beams,num_return_sequences, andno_repeat_ngram_sizecould severely affect the model's performance. Turn these off for a normal comparison. - Your simultaneous use of both models could limit memory bandwidth. I would run these tests separately.
Thank you for your suggestions.
For the first two points, I will test it again.
For the third point, the test script I provided may have misled you. When I was testing, the two models were tested separately. Just for convenience, I put them in the same file.
Hey @lyg95 and @casper-hansen,
Can you please try to benchmark AWQ with HuggingFace's text-generation-interface instead of the native HuggingFace model.generate method? I have a fork that supports loading AWQ models in TGI.
For e.g., you can load AWQ models as follows (after building the source):
text-generation-launcher --huggingface-hub-cache ~/.cache/huggingface/hub/ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq --trust-remote-code --port 8080 --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 --quantize awq
Please install Flash Attention v1 or v2 and vLLM so that you benefit from PagedAttention and FlashAttention.
@abhinavkulkarni Does this integrate with the fused AWQ modules?
For maximum speed, you can also use the AutoAWQ speed benchmark that uses these fused modules per default for all LLaMa models.
python -m awq.entry --entry_type speed --model_path casperhansen/vicuna-7b-v1.5-awq --quant_file awq_model_w4_g128.pt
This should yield 70-80 tokens/s on weak CPU and GPU like RTX 3090+EPYC 7 and on strong CPU and GPU combo like RTX 4090+i9 13900k, it should be 100+ tokens/s
Which CPU is used? And can you post your full code including how you load the models?
Also, it looks like you did not try out TinyChat which offers a 2x faster inference in my experience. This is the fastest that you can run AWQ models right now.
EDIT: To answer your question as well, I believe they measure it differently than you do. They most likely measured GPTQ Triton matrix-vector speed vs the AWQ matrix-matrix speed as mentioned in the paper:
Unlike GPTQ [17] which formulates linear layers as matrix-vector (MV) products, we instead model these layers as matrix-matrix (MM) multiplications. MV can only be executed on slow CUDA cores while MM can be executed on the 16× faster tensor cores on A100 and H100. Our formulation also minimizes structural hazards compared to [17] since MM and other instructions such as dequantization and memory access are executed on separate function units. We also outperform a recent Triton [46] implementation [40] for GPTQ by 2.4× since it relies on a high-level language and forgoes opportunities for low-level optimizations
@casper-hansen I found that it is inconsistent with the content in the paper. Now, WQLinears use the 'gemv' operation instead of 'gemm'. Was this change made for scenarios like token generation in TinyChat?