litgpt
litgpt copied to clipboard
AutoGPTQ integration
Hi there 👋
It's a bit late response to #583
The task itself turned out to be quite large, so in order to speed up the process (and simplify life for those who will review the PR) I decided to include only the basics: the code can quantize model and run an inference, supports all the AutoGPTQ kernels. The rest parts of AutoGPTQ functionality will be added in a subsequent pull requests.
This PR doesn't include:
- loading/uploading quantized weights to/from HF hub
- AWQ support (yes, AutoGPTQ supports even this)
- fused attention and MLP layers (looking forward to implementing it ~no I don't~)
- maybe even something more, but not sure that we should integrate everything
Benchmarks
Benchmarking was done on 1xA10G
with TinyLlama
model (1.1B parameters).
Quantization config:
- 4bit precision
- 128 group_size
- desc_act (act_order) was disabled.
There are two tables: for prefill stage and for new tokens generation stage.
Prefill simulation was done by feeding first 1024 samples from Alpaca dataset into the model and the result was averaged across them. Here only one sample at a time was sent to the model.
New token generation was done by generating 100 new tokens 100 times. The default prompt from generate/base.py
was used.
Prefill
Quantization | Kernel | Precision | Token/sec | VRAM, GB | Perplexity | Utilization |
---|---|---|---|---|---|---|
None | - | 16 | 8422 | 2.67 | 14.01 | 76 |
bnb.nf4 | - | 4 | 4813 | 1.42 | 13.95 | 95 |
gptq | triton | 4 | 3819 | 1.69 | 14.36 | 70 |
gptq | cuda_old | 4 | 5151 | 1.69 | 14.36 | 99 |
gptq | cuda | 4 | 4554 | 1.69 | 14.36 | 99 |
gptq | exllama | 4 | 7965 | 1.69 | 14.29 | 95 |
gptq | exllamav2 | 4 | 7872 | 1.69 | 14.29 | 94 |
gptq | marlin | 4 | 8560 | 1.68 | 14.17 | 75 |
New tokens generation
Quantization | Kernel | Precision | Token/sec | VRAM, GB | Utilization, % |
---|---|---|---|---|---|
None | - | 16 | 55.47 | 2.23 | 46 |
bnb.nf4 | - | 4 | 44.92 | 1.03 | 32 |
gptq | triton | 4 | 27.52 | 1.31 | 46 |
gptq | cuda_old | 4 | 46.17 | 1.31 | 30 |
gptq | cuda | 4 | 39.93 | 1.31 | 97 |
gptq | exllama | 4 | 57.38 | 1.31 | 35 |
gptq | exllamav2 | 4 | 57.04 | 1.31 | 28 |
gptq | marlin | 4 | 55.91 | 1.32 | 30 |
*Most likely these kernels are optimized for A100. That might explain not impressive results and low utilization.
Here one can find benchmarks made by HF team. They also show that Marlin kernel turns out to be the fastest, but not as fast as was expected.
[!NOTE] Marlin kernel only support graphics cards with compute capability >= 8.0. Here one can find a table with graphics cards and their compute capabilities.
Caveats:
- It's not possible to run inference with GPTQ quantization and model compilation.