quantization-intro
quantization-intro copied to clipboard
Introduction to Quantization
Introduction to Quantization
This project was originally concieved as a hackday project for Deep Hack Mar 2023, but has since been extended to be a way to introduce people to Quantization. The corresponding talk was done at GenerativeAI Apr Meetup.
Updated: 2023-04-27 10:05pm IST
Key Points of the talk
Basics
- Neural Networks are universal function approximators, and as such can learn almost anything
- What they learn gets stored as matrices (technically, tensors) of weights & biases. Both of these are usually floating point numbers.
- Floating point (FP) numbers can be represented in computers at different precisions (and thus taking different memory space), e.g. 32-bit floating point number is more precise than a 16-bit number.
- GPUs are faster at matrix operations, and so can process neural networks much faster, but this requires the model to fit in GPU memory, as system memory access is much slower. Smaller (less precise) representations of FP numbers can make this possible.
- Smaller representations of FP numbers also require lesser compute, so GPU can execute more such operations per cycle, a win-win.
- Quantization involves converting bigger FP representations to smaller, to reduce model size, so that it can fit in memory and run faster.
Demo
- Despite reducing precision significantly, we see that performance of the neural network largely remains the same during inference. This shows that DNNs are very resilient to small perturbations in the weights & biases.
- How can we leverage this to quantize better?
Uniform Quantization (INT8)
- We can scale a range of FP numbers (say α to β) to 8-bit integers (-127 to 127) uniformly (technique detailed in talk), by projecting it to the -127 to 127 range.
- Quantization in this way will obviously cluster a bunch of FP numbers to the same INT8 number.
- Dequantization is a straightforward reversal of the process, with the caveat that we don't recover the original number, we recover only the center of the range that got projected to this INT8 number.
- The talk describes the absmax variant of it (i.e. taking the max of abs value of α,β and using that as input range to be projected). There's a zeropoint variant as well, which is not described here.
- Similarly, we can choose to project to INT4 (4-bit integers) than INT8.
Distribution of weights:
- FP numbers in the weights matrix can be distributed in different ways. If we're not careful, a large number of weights can map to the same bucket during quantization, reducing accuracy & perplexity.
- Different techniques use different ways to handle these distributions. For simple ones, absmax INT projection (described above) works, but not always.
- Doing INT projection per row/col of matrix is very often used (some nuances to this that aren't described here).
- LLM.int8 paper shows that at larger scales, outlier features show up which greatly affect output (and so cannot be pruned). We cannot use the row/col quantization, because feature dimensions (axis) lie orthogonal to the matrix product dimensions. LLM.int8 paper presents a method to handle them.
- SmoothQuant takes a different approach to the above problem by smoothening the spike in input (X) by scaling it down with a factor, while scaling up the appropriate weight matrix (Y) entries correspondingly. This effectively shifts the spikes from X to Y, giving us a better range to work with during quantization.
We ran out of time to cover other slides in the talk, but you can refer to some minor comments in the presented notes of those slides.
Resources
- Links included in the slides
- Slide 1: This repo
- Slide 2: ggerganov/llama.cpp - running quantized LLaMA models on a MacBook
- Slide 6: An Image is worth 16x16 words - ViT (Vision Transformer) paper
- Slide 6: Scaling Laws for Neural Language Models
- Slide 8: Nvidia Hopper Architecture
- Slide 10: IEEE-754 Playground
- Slide 10: FP formats used in the DNN world
- Slide 11: A survey of Quantization Methods for Efficient Neural Network Inference
- Slide 15: LLM.int8() paper
- Slide 16: LLM.int8() paper
- Slide 17: SmoothQuant paper
- Slide 18: The case for 4-bit precision: k-bit scaling laws
- Slide 20: GPTQ paper
- Slide 21: A survey of Quantization Methods for Efficient Neural Network Inference
- Over-parametrization related
- Other papers:
- Up or Down? Adaptive Rounding for Post-Training Quantization - AdaRound paper
- Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming - AdaQuant paper - A layer-by-layer optimization method that minimizes the error between the quantized layer output and the full-precision layer output.
- ZeroQuant paper
- Optimal Brain Compression
- Opening the Black Box of Deep Neural Networks via Information
The Demo - Studying the effect of reduced precision
In this demo, we use @hila-chefer's work on Transformer Explainability to empirically study the effect of reduced precision at inference time. The rather successful recent efforts to deploy llama.cpp on hardware as weak as a Raspberry Pi was a major factor to explore this.
Approach
IEEE 754 standard for floating point precision describes a float32 as 1-bit (sign); 8-bits (exponent); 23-bits (fraction)
. We take a rather crude approach to simulating precision by simply truncating the least n
significant bits of the fractional part to zero. We do this for all the parameters of the model.
Important caveats: * While the model's precision was varied, computations were still being done as float32. * Exponent part was not touched, so the range of the model weights remained the same.
How to use
-
pip install -r requirements.txt
-
streamlit run app.py
Results
For every image we tested, the model was able to be resilient at inference time to about 3-bits of fractional precision (translating roughly to a single decimal point).