llm-awq icon indicating copy to clipboard operation
llm-awq copied to clipboard

the question about the speed of AWQ && GPTQ

Open lyg95 opened this issue 2 years ago • 8 comments
trafficstars

" Thanks to our efficient kernels, AWQ achieves 1.45× and 2× speedup over GPTQ and GPTQ with reordering on A100. It is also 1.85× faster than an FP16 cuBLAS implementation "

lyg95 avatar Aug 11 '23 07:08 lyg95

I have tested AWQ and GPTQ on llama-7b, but I do not see a speedup mentioned in the paper. It's true that GPTQ without act-order faster than with act-order.

test environment: GPU: A100 40GB

GPTQ: GPTQ-for-LLaMa triton

The following is the test process myself.

input = 'hf' * (input_length / 2)

max_len = input_length + output_lenght

start_time = timeit.default_timers()

encode  input .... ...

for _ in (repetition):
    model.generate()       


    decode output
   
    record  the number of tokens


compute elapsed time

compute throughput   = total_tokens / elapse time

Here I supply some test results for reference.

input_length: 512 generated length(include input): 768 use_cache=True

fp16: 142.641 tokens/sec awq: 144.663 tokens/sec gptq - no act_order : 138.138 tokens/sec

input_length: 1024 generated length(include input): 1140 use_cache=True

fp16: 441.522 tokens/sec awq: 373.970 tokens/sec gptq: 382.077 tokens/sec

I also test the throughput of the generate process, excluding post-processing. It also has similar results.

when use_cache=False, awq always slower than gptq in my own tests.

Excuse me, is there any problem with my above test process, which resulted in the failure to reproduce the results of the paper.

Can you provide details about the test?

Thank you.

lyg95 avatar Aug 11 '23 08:08 lyg95

Which CPU is used? And can you post your full code including how you load the models?

Also, it looks like you did not try out TinyChat which offers a 2x faster inference in my experience. This is the fastest that you can run AWQ models right now.

EDIT: To answer your question as well, I believe they measure it differently than you do. They most likely measured GPTQ Triton matrix-vector speed vs the AWQ matrix-matrix speed as mentioned in the paper:

Unlike GPTQ [17] which formulates linear layers as matrix-vector (MV) products, we instead model these layers as matrix-matrix (MM) multiplications. MV can only be executed on slow CUDA cores while MM can be executed on the 16× faster tensor cores on A100 and H100. Our formulation also minimizes structural hazards compared to [17] since MM and other instructions such as dequantization and memory access are executed on separate function units. We also outperform a recent Triton [46] implementation [40] for GPTQ by 2.4× since it relies on a high-level language and forgoes opportunities for low-level optimizations

casper-hansen avatar Aug 11 '23 16:08 casper-hansen

The CPU info:Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz When I tested, I use numactl to bind a single CPU core.

Of course, I provide the code for my test here,llama_awq-gptq_infer_test.txt

For TinyChat, I pull the code before TinyChat merged, I haven't tested it and I will do a test.

lyg95 avatar Aug 12 '23 08:08 lyg95

Here is some feedback.

  1. This part should not be a loop, just run tokenizer.decode on generation_output and use token_num += len(generation_output).
for output in generation_output:
    tokenizer.decode(output, skip_special_tokens=True)
    token_num += output.shape[-1]
  1. Your use of num_beams, num_return_sequences, and no_repeat_ngram_size could severely affect the model's performance. Turn these off for a normal comparison.
  2. Your simultaneous use of both models could limit memory bandwidth. I would run these tests separately.

casper-hansen avatar Aug 12 '23 16:08 casper-hansen

Thank you for your suggestions.

For the first two points, I will test it again.

For the third point, the test script I provided may have misled you. When I was testing, the two models were tested separately. Just for convenience, I put them in the same file.

lyg95 avatar Aug 13 '23 10:08 lyg95

Hey @lyg95 and @casper-hansen,

Can you please try to benchmark AWQ with HuggingFace's text-generation-interface instead of the native HuggingFace model.generate method? I have a fork that supports loading AWQ models in TGI.

For e.g., you can load AWQ models as follows (after building the source):

text-generation-launcher --huggingface-hub-cache ~/.cache/huggingface/hub/ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq --trust-remote-code --port 8080 --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 --quantize awq

Please install Flash Attention v1 or v2 and vLLM so that you benefit from PagedAttention and FlashAttention.

abhinavkulkarni avatar Sep 02 '23 11:09 abhinavkulkarni

@abhinavkulkarni Does this integrate with the fused AWQ modules?

For maximum speed, you can also use the AutoAWQ speed benchmark that uses these fused modules per default for all LLaMa models.

python -m awq.entry --entry_type speed --model_path casperhansen/vicuna-7b-v1.5-awq --quant_file awq_model_w4_g128.pt

This should yield 70-80 tokens/s on weak CPU and GPU like RTX 3090+EPYC 7 and on strong CPU and GPU combo like RTX 4090+i9 13900k, it should be 100+ tokens/s

casper-hansen avatar Sep 02 '23 12:09 casper-hansen

Which CPU is used? And can you post your full code including how you load the models?

Also, it looks like you did not try out TinyChat which offers a 2x faster inference in my experience. This is the fastest that you can run AWQ models right now.

EDIT: To answer your question as well, I believe they measure it differently than you do. They most likely measured GPTQ Triton matrix-vector speed vs the AWQ matrix-matrix speed as mentioned in the paper:

Unlike GPTQ [17] which formulates linear layers as matrix-vector (MV) products, we instead model these layers as matrix-matrix (MM) multiplications. MV can only be executed on slow CUDA cores while MM can be executed on the 16× faster tensor cores on A100 and H100. Our formulation also minimizes structural hazards compared to [17] since MM and other instructions such as dequantization and memory access are executed on separate function units. We also outperform a recent Triton [46] implementation [40] for GPTQ by 2.4× since it relies on a high-level language and forgoes opportunities for low-level optimizations

@casper-hansen I found that it is inconsistent with the content in the paper. Now, WQLinears use the 'gemv' operation instead of 'gemm'. Was this change made for scenarios like token generation in TinyChat?

ariwaranosai avatar Oct 02 '23 17:10 ariwaranosai