vllm
vllm copied to clipboard
Add GPTQ quantization kernels for 2, 3, 8-bit use cases
Earlier, there was an awesome PR https://github.com/vllm-project/vllm/pull/916 on supporting the GPTQ Exllama kernel in a 4-bit quantization setup. This PR introduces additional kernels for use cases with different quantization bits, sourced from the AutoGPTQ repository, which is utilized by HF for GPTQ quantization.
The same kernel can also be leveraged by our recent post training quantization work (QuantEase, we'll release the QuantEase algorithm repo soon) https://arxiv.org/abs/2309.01885 where we achieved better performance on Zero-Shot accuracy for 3-bit quantization.
We are adding two additional flags to GPTQConfig which are well aligned with AutoGPTQ & HF convention:
- use_triton: if using triton kernel under 2, 4, 8 bit setup which will be slower than exllama kernel and cuda kernel
- disable_exllama: if disable exllama kernel under 4-bit setup, cuda or triton kernel will used based on use_triton flag
- Under 3-bit setup, default cuda kernel will be used
Test:
Tested on llama 7b model
You need to add the additional args to the saved quantize_config.json after GPTQ quantization, an example:
{
"bits": 3,
"group_size": 128,
"damp_percent": 0.01,
"desc_act": true,
"static_groups": false,
"sym": true,
"true_sequential": true,
"model_name_or_path": null,
"model_file_base_name": null,
"use_triton": false,
"disable_exllama": true
}
Test script
prompt = "What is large language model?"
sampling_params = SamplingParams(temperature=0.8, top_p=0.5, max_tokens=100)
model_path = "..."
llm = LLM(model=model_path, trust_remote_code=True, tensor_parallel_size=2, quantization="gptq", tokenizer_mode="slow")
outputs = llm.generate(prompt, sampling_params)
Output from exllama kernel under 4-bit quantization
total time 1.5789778232574463
average time 1.5789778232574463
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.236552625894547, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'
Output from triton kernel under 4-bit quantization
total time 6.523277759552002
average time 6.523277759552002
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.21131780743599, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'
Output from CUDA kernel under 4-bit quantization
total time 2.3482797145843506
average time 2.3482797145843506
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.14222851395607, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'
Output from CUDA kernel under 3-bit quantization
total time 3.6984071731567383
average time 3.6984071731567383
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set', token_ids=[13, 29906, 29900, 29896, 29947, 29899, 29900, 29896, 29899, 29906, 29945, 29871, 29900, 29900, 29901, 29941, 29896, 29901, 29896, 29906, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 29909, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731, 310, 3651, 322, 263, 731, 310, 13917, 29889, 450, 3651, 526, 2000, 278, 8500, 943, 29892, 322, 278, 13917, 526, 2000, 278, 714, 26807, 29889, 13, 1576, 1904, 338, 1304, 304, 8500, 278, 21957, 310, 4066, 29892, 2183, 278, 8500, 943, 29889, 13, 29909, 2919, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731], cumulative_logprob=-51.450629502534866, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set'