llama2.c Quantization Brainstorming

There are several experiments being done with this repo to understand and evaluate the effects of quantization on the llama2.c models.

It is a great test-bed to analyze the effects of varying approaches, as the model sizes here are easier to handle.

Here is what I have in this fork:

A simple showcase of symmetric quantization using int8_t to store the multipliers and one float each to store the maximum values of each layer type. Thus we have 13 floats and all other weights are stored as uint8_t.
The quantization is done with quantize.c, and the model can be run with runq.c with command like: $ ./runq stories42M_Q8.bin -t 0.1 -n 256 -i "One day, Lily met a Shoggoth" -s 2 It outputs:

One day, Lily met a Shoggoth. She was so excited to meet him. She asked him, "What are you doing here?"
The Shogoth replied, "I'm here to help you learn about the world."
Lily was so happy to have a new friend. She asked, "What can you do?"
The Shogoth replied, "I can help you learn about the world. I can show you all the things that are different."
Lily was so excited to learn about the world. She thanked the Shogoth for being so helpful.
The Shogoth smiled and said, "You're welcome. I'm glad I could help you."
Lily was so happy to have a new friend. She knew that the Shogon was a very special friend.
achieved tok/s: 248.920863

The model sizes are reduced by 4x on disk. During inference, the weights are dequantized to floats so there is no runtime speed up (yet).

Additionally, I have a script that plots the statistics of the weights. (You need to install gnuplot [brew install gnuplot on mac] to use it). This might be useful in deciding which layers etc. are less susceptible to compression etc.

Would love to hear feedback and other approaches. Note, as the goal of this repo is "... to be the simplest, smallest, most hackable repo..".

Here are some notable forks as well: https://github.com/atamurad/llama2.c/tree/quant https://github.com/kroggen/llama2.c/tree/quantization-q8 [Please add yours..]

Aug 12 '23 19:08 byte-6174

llama2.scala supports ggml-like q4_0 and q8 quantization, doing the quantization on the fly before inference (and also the ability to load ggml models using these quantization types). q4 and q8 have similar speed (when optimized using AVX2 kernels similar to the ones in ggml) which is significantly faster than fp32 (probably caused by more vector lanes, less memory access). The biggest benefit that I see for q4 is obviously that you can load and run bigger models in the same amount of memory. One issue that I noticed, is that auto-vectorization stops to work well for int8 (I suspect because the vpmaddusqw instruction which does the bulk of the int8 matrix multiplication involves saturation which might not be easily expressed in non-vectorized C code).

Aug 13 '23 08:08 jrudolph

Hello @jrudolph , Can you please help me understand ggml matmul execution wrt quantization. is it input_float32 * Quantized_weights -> Output_float32 or Quantize(input_float32) * Quantized_Weights -> Output_float32 -> Quantize(Output_float32) for next layer ?

I am trying to Come up with very simple int4 and int8 quantization in C++ for llama2.c . My goal is to initially match speed of ggml/llama.cpp quantization and execution level. You help would be appreciated.

Aug 13 '23 09:08 Nick-infinity

For q4_0 and q8 the one dimensional vector needs to be first quantized to q8. Then, do one block of vector products in int32 and use block scales into f32 running sum. Then store result for each row and continue with next row.

See https://github.com/jrudolph/llama2.scala/blob/08c65d04c0a3a4345510db289779e3243bcf7ff9/shared/src/main/scala/net/virtualvoid/llama2/ScalaMathImplementation.scala#L70 as an example assuming the onedim vector has already been quantized.

Nickinfinity @.***> schrieb am So., 13. Aug. 2023, 11:04:

Hello @jrudolph https://github.com/jrudolph , Can you please help me understand ggml matmul execution wrt quantization. is it input_float32 * Quantized_weights -> Output_float32 or Quantize(input_float32) * Quantized_Weights -> Output_float32 -> Quantize(Output_float32) for next layer ?

— Reply to this email directly, view it on GitHub https://github.com/karpathy/llama2.c/issues/277#issuecomment-1676288925, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACNDCQBMCFZ7HZ6A55W63XVCKBNANCNFSM6AAAAAA3ODMNVQ . You are receiving this because you were mentioned.Message ID: @.***>

Aug 13 '23 10:08 jrudolph

In general, I guess you can use whatever works. Doing the bulk of multiplications in int8 means that you can do more elements per instruction with vector instructions than with f32 etc. Also integer calculations may per faster than float depending on the architecture, etc

Aug 13 '23 11:08 jrudolph

Thanks for your reply.

When you say one dimensional vector, are you talking about the input 1d vector ?

Sorry If this is basic question.

Thanks

On Sun, Aug 13, 2023, 5:09 PM Johannes Rudolph @.***> wrote:

In general, I guess you can use whatever works. Doing the bulk of multiplications in int8 means that you can do more elements per instruction with vector instructions than with f32 etc. Also integer calculations may per faster than float depending on the architecture, etc

— Reply to this email directly, view it on GitHub https://github.com/karpathy/llama2.c/issues/277#issuecomment-1676330837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIHM53456KAHGZ4QPCXRU43XVC4H3ANCNFSM6AAAAAA3ODMNVQ . You are receiving this because you commented.Message ID: <karpathy/llama2. @.***>

Aug 13 '23 13:08 Nick-infinity

If I understood it correctly, then it means for mat mul we have to quantize the input 1d array. I am wondering if latency to quantize this vector can surpass the gains of doing product in int32.

On Sun, Aug 13, 2023, 6:48 PM Nikhil Gupta @.***> wrote:

Thanks for your reply.

When you say one dimensional vector, are you talking about the input 1d vector ?

Sorry If this is basic question.

Thanks

On Sun, Aug 13, 2023, 5:09 PM Johannes Rudolph @.***> wrote:

In general, I guess you can use whatever works. Doing the bulk of multiplications in int8 means that you can do more elements per instruction with vector instructions than with f32 etc. Also integer calculations may per faster than float depending on the architecture, etc

— Reply to this email directly, view it on GitHub https://github.com/karpathy/llama2.c/issues/277#issuecomment-1676330837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIHM53456KAHGZ4QPCXRU43XVC4H3ANCNFSM6AAAAAA3ODMNVQ . You are receiving this because you commented.Message ID: <karpathy/llama2 .@.***>

Aug 13 '23 13:08 Nick-infinity

I am wondering if latency to quantize this vector can surpass the gains of doing product in int32.

I guess it might for certain shape. For llama, e.g. the single most expensive matmul is the output computation into logits. There you have a weights matrix of dim X vocab. So, let's say dim is 4096 and vocab is 32000. You have to do the input quantization for 4096 elements of the input vector once but then use it to multiply 32000 rows into a 32000 row vector (32000 * 4096 multiplications). So, if there's a speed benefit of doing quantized calculations, it will amortize quickly.

For SIMD it is a requirement that the vector types line up, so you will have to do it. For regular types, I guess it might depend on the speed of integer vs float processing units in your CPU whether it makes sense.

Aug 13 '23 14:08 jrudolph

@jrudolph for activation quantization, do you use data statistics? if so what data is used? if no data is used, how are the acts calculated in your scala implementation? sorry, I have 0 scala exp. so reading code is a little tough :)

Aug 13 '23 15:08 byte-6174

@jrudolph for activation quantization, do you use data statistics? if so what data is used? if no data is used, how are the acts calculated in your scala implementation?

Not sure, what you mean exactly. I just reused the way that llama.cpp does things. For each weights quantization type they also define which quantization format to use for the activations, and then provide a vec_dot implementation that can multiply those two types (e.g. see https://github.com/ggerganov/llama.cpp/blob/ee77efea2a1e3f7d153976b0934522b6bbaa62e6/ggml.c#L1657-L1663).

For both, q4_0 and q8 they use q8 for the activations. q4_0 and q8 work similarly that each rows is split in blocks of 32 elements, then the range is determined (by finding the maximum absolute value) and values are linearly rescaled centered around zero from -max..max to -128..127 (for q8) or (0 to 15) for q4_0. Then you keep those quantized values and a single (fp16 or fp32) scaling factor per block.

Is that what you mean with data statistics?

Aug 13 '23 20:08 jrudolph

So:

the weights are quantized once during model export
the data (activations) are quantized dynamically on demand during forward pass
however i'd expect not all layers are quantized, only the matmul layers (?). e.g. rmsnorms are processed in higher precision (?). I haven't verified these, it's just what's done commonly in practice.

There are many other ways of doing quantization too. E.g. you can try to "calibrate" models by passing many batches through them and recording the activation ranges at all the layers at that time. These ranges are then used in the forward pass later, skipping the process of determinining those ranges.

Aug 13 '23 21:08 karpathy

however i'd expect not all layers are quantized, only the matmul layers (?). e.g. rmsnorms are processed in higher precision (?). I haven't verified these, it's just what's done commonly in practice.

Good point, I forgot about these, in the ggml q4_0 files for llama2, the norm weights are all stored in fp32, i.e. the ones corresponding to rms_att_weight, rms_ffn_weight, and rms_final_weight in llama2.c.

If you look at the table in https://huggingface.co/TheBloke/Llama-2-13B-GGML#provided-files, one can see that there are various ways to use different quantization types for different weights. https://github.com/ggerganov/llama.cpp/pull/1684 is the PR that introduced the latest set of quantization setups for llama.cpp and contains lots of information about the choices made.

There are many other ways of doing quantization too. E.g. you can try to "calibrate" models by passing many batches through them and recording the activation ranges at all the layers at that time. These ranges are then used in the forward pass later, skipping the process of determining those ranges.

For people interested: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers has more info about such an approach.

Aug 14 '23 09:08 jrudolph

I'm trying to understand how it is done from the links you provided above, at the base level. I understand that there are many complex mix-and-match strategies like keeping certain layers at high precision etc. however, at the very basic level in llama.cpp the quantization that seems to be done is q4_0. The code that actually does the q4_0 quantization is in ggml.c:

static void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int k) {
    static const int qk = QK4_0;
    assert(k % qk == 0);
    const int nb = k / qk;
    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }
        const float d  = max / -8;
        const float id = d ? 1.0f/d : 0.0f;
        y[i].d = GGML_FP32_TO_FP16(d);
        for (int j = 0; j < qk/2; ++j) {
            const float x0 = x[i*qk + 0    + j]*id;
            const float x1 = x[i*qk + qk/2 + j]*id;
            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

This seems to do symmetric quantization in 4-bits (?).

I don't see any of the GPTQ-type Hessian computation with data, and matrix inversion as described in the GPTQ paper above. Can you point me to where that is happening?

Aug 14 '23 16:08 byte-6174

Sorry, I didn't want to imply that GPTQ style quantization is done in llama.cpp. I'm not sure it is.

Aug 14 '23 17:08 jrudolph

got it, thanks..

Aug 14 '23 18:08 byte-6174

@byte-6174 thanks for linking to my branch!

I wanted to add some details/results so far as my branch is draft and not documented yet.

Code structure:

QMatrix - new 2D data structure to represent quantized weights and qmatmul function to perform matrix-vector multiplication. I think it helps with code readability as transformer() function requires no changes at all.
File format - I wanted to experiment with different quantization methods for different layers/weights in one file/model. QMatrix has 4-byte type to represent quantization type - 'Q8_A', 'Q8_B', etc. inspired by https://en.wikipedia.org/wiki/FourCC
quant.py - quantization code is implemented in Python and run.c only contains de-quantization code. This is to keep it consistent with the project structure - right now all models are converted/trained/exported from Python.
Code is not optimized for speed as I focused on getting quantization and model output right first.

Quantization methods:

Q8_A - Block of 128 weights in column. Only scale (fp32) parameter and 128x 8bit ints per block.
Q8_B - Block of 128 weights in row. scale (fp32) and mean (fp32) parameter and 128x 8bit ints per block.
Q4_A - Block of 256 weights in row. Weights are sorted and split into equal 16 bins, each bin containing 16 elements. Bin mean values (bins (16 x fp32)) and 256x 4bit indexes packed to 128 bytes are exported.

Results summary

Stories110M model, Q4_A => 79MB model size, did not observe any degradation in output quality.
LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected.
LLama2-7B-chat, Q4_A => 4.8GB model, didn't work, output is gibberish.

Sample output

./run llama2_7b_chat.q8_pooled -i "[INST] write a poem about math [/INST]" 
[INST] write a poem about math [/INST]  Sure! Here's a poem about MATH:

Math, the beat of life,
A rhythm so precise and true,
In every line, a code unbroken,
 Numbers that flow, like a river's tide,
Geometry of life, a equation so grand,
Squared, the truth unveiled,
The beauty of numbers, a cosmic,

[truncated]

Aug 15 '23 10:08 atamurad

@atamurad:

LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected.

slower than float32 model?

Aug 15 '23 15:08 byte-6174

Just a reminder that FlashAttention also reduces the amount of memory required

It does not need the intermediary attention matrix for each head, instead computing by tiles, and applying a trick on the softmax (computed in chunks, then normalized with factors)

This could be the right project for a simple implementation

Aug 16 '23 03:08 kroggen

I wonder if Flash Attention even applies here? AFAIU the big NxN matrix only turns up when evaluating the prompt in batches (in training or when having big input prompts during inference). In this project, tokens are currently processed one-by-one, so that only one row of the attention matrix is ever materialized at once.

From playing with the smaller models (with limited context sizes) that are useful with CPU inference it seems most of the time is spent evaluating the big matrix calculations for the FFN and logit calculations, while attention only requires significant calculation when the context starts to get filled.

Correct me if I'm wrong, but it seems many of the optimizations target big models for serious productization. The sequential nature of deep neural nets plays against the strength of GPUs to massively parallize, so to saturate GPUs, one has to come up with strategies how to make each calculation wider (batch processing of prompts or evaluating multiple prompts at the same time) without exploding memory requirements (which is where Flash Attention seems to come into play).

Aug 16 '23 07:08 jrudolph

@jrudolph Yes correct, also flash attention works by reducing IO b/w GPU HBM & SRAM, which doesn't apply here since its CPU only inference.

Author of the paper pointed it out here : https://github.com/Dao-AILab/flash-attention/issues/59

Aug 16 '23 08:08 RahulSChand

@byte-6174 thanks for linking to my branch!

I wanted to add some details/results so far as my branch is draft and not documented yet.

Code structure:

QMatrix - new 2D data structure to represent quantized weights and qmatmul function to perform matrix-vector multiplication. I think it helps with code readability as transformer() function requires no changes at all.

File format - I wanted to experiment with different quantization methods for different layers/weights in one file/model. QMatrix has 4-byte type to represent quantization type - 'Q8_A', 'Q8_B', etc. inspired by https://en.wikipedia.org/wiki/FourCC

quant.py - quantization code is implemented in Python and run.c only contains de-quantization code. This is to keep it consistent with the project structure - right now all models are converted/trained/exported from Python.

Code is not optimized for speed as I focused on getting quantization and model output right first.

Quantization methods:

Q8_A - Block of 128 weights in column. Only scale (fp32) parameter and 128x 8bit ints per block.

Q8_B - Block of 128 weights in row. scale (fp32) and mean (fp32) parameter and 128x 8bit ints per block.

Q4_A - Block of 256 weights in row. Weights are sorted and split into equal 16 bins, each bin containing 16 elements. Bin mean values (bins (16 x fp32)) and 256x 4bit indexes packed to 128 bytes are exported.

Results summary

Stories110M model, Q4_A => 79MB model size, did not observe any degradation in output quality.

LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected.

LLama2-7B-chat, Q4_A => 4.8GB model, didn't work, output is gibberish.

Sample output
./run llama2_7b_chat.q8_pooled -i "[INST] write a poem about math [/INST]" 
[INST] write a poem about math [/INST]  Sure! Here's a poem about MATH:

Math, the beat of life,
A rhythm so precise and true,
In every line, a code unbroken,
 Numbers that flow, like a river's tide,
Geometry of life, a equation so grand,
Squared, the truth unveiled,
The beauty of numbers, a cosmic, 
[truncated]

@atamurad Is the following line in quantize_q8_a() correct?

scales[i][j] = np.max(np.abs(m[i:i+QK, j]))

I think it should be

scales[i][j] = np.max(np.abs(m[i*QK:(i+1)*QK, j]))

Maybe I am referring to the wrong repo?

Aug 16 '23 14:08 mgrabban

@mgrabban good catch, thank you!

I was wondering why Q8_A wasn't working for weights other than WQ, WK, WV, WO and switched these weights to Q8_B.

Why it worked for WQ,WK,WV,WO is probably these all share almost same or very close max value for all grouped rows.

Aug 16 '23 14:08 atamurad

I've another data point to add: I had some success running 4-bit quantized LLama2-7B-chat model with run.c.

Speedup is 10x compared to FP32 weights. 4bit model file size: 4.3GB

Quantization is based on (AWQ)

Activations are in FP32, so only matmul has changed in run.c. I used AVX2 for dequantization + matrix multiplication. For 32bit weights (only final logit classifier), I also use https://github.com/karpathy/llama2.c/pull/269

repo: https://github.com/atamurad/llama2.c/tree/int4-avx2

Issues: long prompts/generation is affected by this bug in HF export script: https://github.com/karpathy/llama2.c/pull/286#issuecomment-1679066644

Aug 19 '23 02:08 atamurad

@atamurad Hello, I failed export model when i use the export_awq.py. The error is like "KeyError: 'model.layers.0.mlp.gate_proj.qweight'". As I'm new to this process, I was wondering if you could provide any suggestions? Or could it be that the script is not up-to-date?

Sep 24 '24 03:09 pluto-llf