litgpt AutoGPTQ integration

AutoGPTQ integration

Open Andrei-Aksionov opened this issue 1 year ago • 0 comments

Hi there 👋

It's a bit late response to #583

The task itself turned out to be quite large, so in order to speed up the process (and simplify life for those who will review the PR) I decided to include only the basics: the code can quantize model and run an inference, supports all the AutoGPTQ kernels. The rest parts of AutoGPTQ functionality will be added in a subsequent pull requests.

This PR doesn't include:

loading/uploading quantized weights to/from HF hub
AWQ support (yes, AutoGPTQ supports even this)
fused attention and MLP layers (looking forward to implementing it ~no I don't~)
maybe even something more, but not sure that we should integrate everything

Benchmarks

Benchmarking was done on 1xA10G with TinyLlama model (1.1B parameters).

Quantization config:

4bit precision
128 group_size
desc_act (act_order) was disabled.

There are two tables: for prefill stage and for new tokens generation stage. Prefill simulation was done by feeding first 1024 samples from Alpaca dataset into the model and the result was averaged across them. Here only one sample at a time was sent to the model. New token generation was done by generating 100 new tokens 100 times. The default prompt from generate/base.py was used.

Prefill

Quantization	Kernel	Precision	Token/sec	VRAM, GB	Perplexity	Utilization
None	-	16	8422	2.67	14.01	76
bnb.nf4	-	4	4813	1.42	13.95	95
gptq	triton	4	3819	1.69	14.36	70
gptq	cuda_old	4	5151	1.69	14.36	99
gptq	cuda	4	4554	1.69	14.36	99
gptq	exllama	4	7965	1.69	14.29	95
gptq	exllamav2	4	7872	1.69	14.29	94
gptq	marlin	4	8560	1.68	14.17	75

New tokens generation

Quantization	Kernel	Precision	Token/sec	VRAM, GB	Utilization, %
None	-	16	55.47	2.23	46
bnb.nf4	-	4	44.92	1.03	32
gptq	triton	4	27.52	1.31	46
gptq	cuda_old	4	46.17	1.31	30
gptq	cuda	4	39.93	1.31	97
gptq	exllama	4	57.38	1.31	35
gptq	exllamav2	4	57.04	1.31	28
gptq	marlin	4	55.91	1.32	30

*Most likely these kernels are optimized for A100. That might explain not impressive results and low utilization.

Here one can find benchmarks made by HF team. They also show that Marlin kernel turns out to be the fastest, but not as fast as was expected.

[!NOTE] Marlin kernel only support graphics cards with compute capability >= 8.0. Here one can find a table with graphics cards and their compute capabilities.

Caveats:

It's not possible to run inference with GPTQ quantization and model compilation.

Feb 12 '24 16:02 Andrei-Aksionov

litgpt litgpt copied to clipboard

AutoGPTQ integration

Benchmarks

litgpt
litgpt copied to clipboard