vllm Add GPTQ quantization kernels for 2, 3, 8-bit use cases

Add GPTQ quantization kernels for 2, 3, 8-bit use cases

Open JasonZhu1313 opened this issue 6 months ago • 7 comments

Earlier, there was an awesome PR https://github.com/vllm-project/vllm/pull/916 on supporting the GPTQ Exllama kernel in a 4-bit quantization setup. This PR introduces additional kernels for use cases with different quantization bits, sourced from the AutoGPTQ repository, which is utilized by HF for GPTQ quantization.

The same kernel can also be leveraged by our recent post training quantization work (QuantEase, we'll release the QuantEase algorithm repo soon) https://arxiv.org/abs/2309.01885 where we achieved better performance on Zero-Shot accuracy for 3-bit quantization.

We are adding two additional flags to GPTQConfig which are well aligned with AutoGPTQ & HF convention:

use_triton: if using triton kernel under 2, 4, 8 bit setup which will be slower than exllama kernel and cuda kernel
disable_exllama: if disable exllama kernel under 4-bit setup, cuda or triton kernel will used based on use_triton flag
Under 3-bit setup, default cuda kernel will be used

Test:

Tested on llama 7b model

You need to add the additional args to the saved quantize_config.json after GPTQ quantization, an example:

{
  "bits": 3,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": true,
  "static_groups": false,
  "sym": true,
  "true_sequential": true,
  "model_name_or_path": null,
  "model_file_base_name": null,
  "use_triton": false,
  "disable_exllama": true
}

Test script

prompt = "What is large language model?"
sampling_params = SamplingParams(temperature=0.8, top_p=0.5, max_tokens=100)
model_path = "..."
llm = LLM(model=model_path, trust_remote_code=True, tensor_parallel_size=2, quantization="gptq", tokenizer_mode="slow")
outputs = llm.generate(prompt, sampling_params)

Output from exllama kernel under 4-bit quantization

total time 1.5789778232574463
average time 1.5789778232574463
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.236552625894547, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from triton kernel under 4-bit quantization

total time 6.523277759552002
average time 6.523277759552002
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.21131780743599, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from CUDA kernel under 4-bit quantization

total time 2.3482797145843506
average time 2.3482797145843506
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.14222851395607, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from CUDA kernel under 3-bit quantization


total time 3.6984071731567383
average time 3.6984071731567383
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set', token_ids=[13, 29906, 29900, 29896, 29947, 29899, 29900, 29896, 29899, 29906, 29945, 29871, 29900, 29900, 29901, 29941, 29896, 29901, 29896, 29906, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 29909, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731, 310, 3651, 322, 263, 731, 310, 13917, 29889, 450, 3651, 526, 2000, 278, 8500, 943, 29892, 322, 278, 13917, 526, 2000, 278, 714, 26807, 29889, 13, 1576, 1904, 338, 1304, 304, 8500, 278, 21957, 310, 4066, 29892, 2183, 278, 8500, 943, 29889, 13, 29909, 2919, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731], cumulative_logprob=-51.450629502534866, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set'

Dec 20 '23 21:12 JasonZhu1313

vllm vllm copied to clipboard

Add GPTQ quantization kernels for 2, 3, 8-bit use cases

Test:

Test script

Output from exllama kernel under 4-bit quantization

Output from triton kernel under 4-bit quantization

Output from CUDA kernel under 4-bit quantization

Output from CUDA kernel under 3-bit quantization

vllm
vllm copied to clipboard