llama.cpp GPTQ Quantization (3-bit and 4-bit)

4-bit quantization tends to come at a cost of output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit/2-bit) quantization methods and even when compared with uncompressed fp16 inference.

It would be good to see benchmarks on the existing implementation. It's possible there is substantial quality loss from the 4-bit quantization. It's also possible that it isn't very substantial. We'd have to see benchmarks to know.

The related project GPTQ-for-LLaMA has some benchmarks available for their implementation.

Refernces: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers The case for 4-bit precision: k-bit Inference Scaling Laws

Related work: https://github.com/qwopqwop200/GPTQ-for-LLaMA/

Mar 11 '23 10:03 MarkSchmidty

The GitHub Issue for text-generation-webui's implementation of GPTQ-for-LLaMA may also be a helpful reference.

Mar 11 '23 11:03 MarkSchmidty

There’s no real benefit to using this method for 4-bit. Would be neat to see 3-bit or 2-bit attempts though.

Mar 11 '23 18:03 dustydecapod

Well three is likely a minor, currently unknown (pending benchmarks), benefit to GPTQ for 4bit-- yes?

Additionally, once 4bit GPTQ is implemented 3bit and 2bit are not much additional work and could have much larger benefits to VRAM/URAM consumption on top of the current 4bit implementation with potentially very little (if acceptable) output quality loss.

Mar 11 '23 21:03 MarkSchmidty

WebAssembly implementation is blocked pending 3-bit inference due to WASM's 4GB memory constraint.

Mar 13 '23 18:03 MarkSchmidty

I had a quick glance at the GPTQ paper yesterday, but haven't dug into details yet.

Do you think it is possible to demonstrate a simple routine for performing quantization using this method? For example, what is the most trivial way (not necessary to be optimal) to implement a function like this:

// src - input 32-bit floats
// dst - output quantized data
// n - number of input floats
void quantize_gptq(float * src, void * dst, int n);

If I can get a prototype of this and it does not look too complex, I can try to plug it in ggml. The main challenge will be to implement it efficiently with SIMD, but I need to see some initial implementation to work on.

Mar 13 '23 19:03 ggerganov

@zoidbb or @qwopqwop200 might have an answer for the question above.

Mar 13 '23 22:03 MarkSchmidty

The actual quantization algorithm (spread across that file and another) seems to be a little hairy, and uses some nontrivial linear algebra (Cholesky decomposition) that we'd either have to reimplement or pull in another dependency for (LAPACK?).

However, if I read the CUDA kernels that they have for evaluation correctly, the format in which the quantized weights wind up is more or less equivalent to the Q4_1 mode (4-bit quantization with blockwise f16 zero offset) that we already have support for, though there currently is no AVX2 implementation for that mode. Since someone has uploaded precomputed GPTQ weights to Huggingface, it might be worthwhile to start out by implementing the right form of accelerated Q4_1 evaluation plus some to directly convert the Huggingface pickles into an appropriate format for us.

Mar 13 '23 22:03 blackhole89

https://twitter.com/NolanoOrg/status/1635409631530057728

Q4_0 mode will be worse with GPTQ for 7B than current Round-to-nearest quantization approach. However, in other cases it's better (only tested upto 13B models).

In Q4_1 and 13B it can not only reduce RAM (by changing bin size QK from 32 to higher - like 128), but also improve performance.

Also 3-bit 13B GPTQ will perform better than 7B at FP16.

Disclaimer - these were observed on a small subset of WikiText and Penn TreeBank (following GPTQ).

Mar 14 '23 01:03 Ayushk4

Also, the above has results on both Q4_0 and Q4_1. GPTQ can be implemented for Q4_0 by hardcoding zero-offset at https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/9bbfdeda2c80c20d8bc1babf4d4368df574afe30/quant.py#L6

Mar 14 '23 01:03 Ayushk4

I have no understanding of GPTQ quantization algorithms. However, for GPTQ quantization, refer to the following: GPTQ Quantizer llama_sequential For model inference, you can refer to: vecquant4matmul_cuda and VecQuant4MatMulKernel QuantLinear For inference code, it has been implemented to replace Linear layer with QuantLinear.

Mar 14 '23 01:03 qwopqwop200

According to this paper, 3 or 2 bit quantization is not a very good idea. Also, unlike GPTQ, RTN can perform poorly on LLM.

Mar 14 '23 01:03 qwopqwop200

That does not consider groupping/binning of less than 64 and data-dependent quantization with weight reconstruction - which is already being used now (with QK=32 --- bin of size 32) https://arxiv.org/pdf/2206.09557.pdf, https://arxiv.org/pdf/2210.17323.pdf and GPTQ (table 4 - last row, table 6 in GPTQ) all show otherwise - promise of lower than 4-bit quantization when using groupping/binning.

Mar 14 '23 01:03 Ayushk4

RTN is the one currently being used and it is the go-to baseline (not the best and decent for int4). But empirically, when zeroOffset is fixed, then GPTQ does worse than RTN when done on 7B LLaMa.

Mar 14 '23 02:03 Ayushk4

I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including LLaMA in WebAssembly (which has 32bit addressing with 4GB max address space) and LLaMA on Raspberry Pi.

Aside from that, the benefits of GPTQ seem to go up with model size; like potentially enabling both 16GB and 32GB devices to move up one entire model size as well as increasingly better performance.

Mar 14 '23 02:03 MarkSchmidty

Depends on what you mean by benchmark:

If you mean perplexity/loss then you can find graph and link to google docs containing the table in the blog: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
If you mean running time - then that is still pending with int-3 quant and quant 4 with 128 bin size.

Mar 14 '23 02:03 Ayushk4

I should be out with atleast one of the two (int-3 quant or quant 4 kernels and their running time) by tomorrow - will share the code once I am done.

Mar 14 '23 02:03 Ayushk4

Those 3bit graphs look better than I expected actually. This is quite promising. Thanks for your contributions @Ayushk4!

Mar 14 '23 02:03 MarkSchmidty

I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including LLaMA in WebAssembly (which has 32bit addressing with 4GB max address space) and LLaMA on Raspberry Pi.

Aside from that, the benefits of GPTQ seem to go up with model size; like potentially enabling both 16GB and 32GB devices to move up one entire model size as well as increasingly better performance.

I'm curious if anyone has tried LLaMA on an 8GB RPi. If not, I might be the first.

Mar 14 '23 02:03 Ronsor

I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including LLaMA in WebAssembly (which has 32bit addressing with 4GB max address space) and LLaMA on Raspberry Pi. Aside from that, the benefits of GPTQ seem to go up with model size; like potentially enabling both 16GB and 32GB devices to move up one entire model size as well as increasingly better performance.

I'm curious if anyone has tried LLaMA on an 8GB RPi. If not, I might be the first.

Please post your results in the issue for that: https://github.com/ggerganov/llama.cpp/issues/58 Lots of people will be interested to see how it goes and potentially help you work out how to get the best performance.

Mar 14 '23 02:03 MarkSchmidty

That does not consider groupping/binning - which is already being used now (with QK=32 --- bin of size 32) https://arxiv.org/pdf/2206.09557.pdf, https://arxiv.org/pdf/2210.17323.pdf and GPTQ (table 4 - last row, table 6 in GPTQ) all show otherwise - promise of lower than 4-bit quantization when using groupping/binning.

It sure looks interesting. The reason I didn't consider grouping is because I failed to implement grouping with CUDA. Clearly, implementing grouping seems to have practical advantages.

Mar 14 '23 03:03 qwopqwop200

I just committed in some code to get AVX2-accelerated Q4_1 inference into a new branch.

Since I've never written any sort of SIMD code before, this was a bit of a learning experience, and I can't guarantee its optimality. As it stands, I think it's somewhere around 50% slower than Q4_0 on the 7B model. (I get 300~400ms/tokm whereas I had 200~300 on Q4_0)

However, it's still significantly faster than the unvectorised implementation that was there before, which was more in the region of a second or two per token, and in fact seems to make the difference between Q4_1 being academic and tolerable.

@ggerganov @Const-me Care to take a look?

Mar 14 '23 18:03 blackhole89

@blackhole89 You can probably combine xsum/ysum into a single vector, like that:

__m256i xSumInt = _mm256_sad_epu8( bx, _mm256_setzero_si256() );
__m256i ySumInt = _mm256_sad_epu8( by, _mm256_setzero_si256() );
__m256i sumInt = _mm256_or_si256( xSumInt, _mm256_slli_si256( ySumInt, 4 ) );
// Even lanes = xSumInt, odd lanes = ySumInt
__m256 sum = _mm256_cvtepi32_ps( sumInt );

Similarly, combine the multipliers:

const __m256 scale_01 = _mm256_blend_ps( scale_0, scale_1, 0b10101010 );

These instructions are cheap, but they are saving _mm256_cvtepi32_ps and _mm256_fmadd_ps, which are slightly more expensive. Might become slightly faster overall.

Another thing, you should definitely remove the * QK from that line: acc_offset += (*m0)*(*m1)*QK; Instead, apply that * QK multiplier after the loop. Scaling floats by powers of 2 is a lossless operation, anyway. BTW, note there's std::fmaf in the standard library, in <cmath> header.

Also, you’re loading from these float pointers multiple times to get the same values. I’m pretty sure recent versions of clang going to optimize these away, but there’re also older less optimal compilers out there. I would rather load only once from each pointer with _mm256_broadcast_ss, then use _mm256_cvtss_f32 to accumulate the scalars.

Mar 14 '23 23:03 Const-me

Thanks! I implemented almost all of these more or less as you said. (I still accumulate the scalars, though I wondered if lifting up the multiplication to be near the AVX2 load would have some kind of cache advantages. That's way above my paygrade though.) Hard to tell how much faster it got, since I had a lot of jitter.

(Does std::fmaf convey a significant benefit over scalar multiply-add, assuming decent optimisations?)

Mar 15 '23 00:03 blackhole89

@blackhole89 About the scalar FMA, compilers only do that automatically when -ffast-math (gcc, clang) or /fp:fast (vc++) switch is specified. Most people don’t specify these switches when compiling their C++. By default, compilers implementing IEEE 754, that’s why they can’t optimize a += b * c; into a = fma( b, c, a );

The issue with std::fma is that unless compiling for processors with FMA3, the function is much slower than multiply+add combo. However, in this particular function the surrounding code already requires FMA3 support, because it uses the _mm256_fmadd_ps intrinsic.

Mar 15 '23 14:03 Const-me

I don't know much C++, but I asked GPT4 to look at the Python GPTQ code and then modify the existing lama.cpp quantization code, and this is what it came up with-- might be useful (although it does require a dependency, the Eigen library for doing linear algebra):

#include <cassert>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <iostream>
#include <map>
#include <regex>
#include <string>
#include <vector>

#include <Eigen/Cholesky>
#include <Eigen/Core>

// TODO: move somewhere else
#define QK 32

// Quantize matrix using the GPTQ-like method
Eigen::MatrixXf quantize_gptq(const Eigen::MatrixXf &mat, int k) {
  int rows = mat.rows();
  int cols = mat.cols();

  Eigen::MatrixXf mat_t = mat.transpose();
  Eigen::MatrixXf cov = mat_t * mat / rows;

  Eigen::LLT<Eigen::MatrixXf> llt(cov);
  Eigen::MatrixXf L = llt.matrixL();

  Eigen::MatrixXf L_inv = L.inverse();
  Eigen::MatrixXf Q = mat * L_inv;
  Eigen::MatrixXf Q_rounded = Q.unaryExpr([](float v) { return std::round(v); });

  Eigen::MatrixXf Q_clamped = Q_rounded;

  for (int i = 0; i < Q_clamped.rows(); ++i) {
    for (int j = 0; j < Q_clamped.cols(); ++j) {
      float q_min = -(1 << (k - 1));
      float q_max = (1 << (k - 1)) - 1;
      Q_clamped(i, j) = std::max(q_min, std::min(q_max, Q_clamped(i, j)));
    }
  }

  Eigen::MatrixXf W_quantized = Q_clamped * L;

  return W_quantized;
}

int main() {
  // Test the quantize_gptq function with a simple example
  Eigen::MatrixXf W(4, 4);
  W << 1.0, 0.5, 0.3, 0.1, 0.5, 1.0, 0.3, 0.6, 0.3, 0.3, 1.0, 0.8, 0.1, 0.6, 0.8, 1.0;

  std::cout << "Original Matrix W:\n" << W << std::endl;

  Eigen::MatrixXf W_quantized = quantize_gptq(W, 4);

  std::cout << "Quantized Matrix W:\n" << W_quantized << std::endl;

  return 0;
}

Mar 16 '23 16:03 Dicklesworthstone

Another example, showing how it suggests to implement it (let me know if these are wrong and I'll delete them! Just trying to be helpful and don't want to pollute the thread with useless stuff!):

#include "ggml.h"
#include "utils.h"

#include <cassert>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <map>
#include <string>
#include <vector>
#include <regex>
#include <Eigen/Dense>

using namespace Eigen;

// ... (existing code, same as before)

// Cholesky decomposition
MatrixXd cholesky_decomposition(MatrixXd &matrix) {
    LLT<MatrixXd> llt(matrix);
    return llt.matrixL();
}

// Cholesky inverse
MatrixXd cholesky_inverse(MatrixXd &matrix) {
    MatrixXd L = cholesky_decomposition(matrix);
    MatrixXd Linv = L.inverse();
    return Linv.transpose() * Linv;
}

// ... (existing code, same as before)

bool llama_model_quantize(const std::string & fname_inp, const std::string & fname_out, int itype) {
    // ... (existing code, same as before)

    if (quantize) {
        // ... (existing code, same as before)

        // Apply Cholesky decomposition and Cholesky inverse
        MatrixXd m_f32(ne[0], ne[1]);
        for (int i = 0; i < ne[0]; ++i) {
            for (int j = 0; j < ne[1]; ++j) {
                m_f32(i, j) = data_f32[i * ne[1] + j];
            }
        }

        MatrixXd m_chol_inv = cholesky_inverse(m_f32);
        for (int i = 0; i < ne[0]; ++i) {
            for (int j = 0; j < ne[1]; ++j) {
                data_f32[i * ne[1] + j] = m_chol_inv(i, j);
            }
        }

        // ... (existing quantization code, same as before)
    }

    // ... (existing code, same as before)
}

// ... (existing code, same as before)

Mar 16 '23 17:03 Dicklesworthstone

For a version that only uses the stdlib:

#include <iostream>
#include <vector>
#include <cmath>
#include <cassert>
#include <execution>

// Perform Cholesky decomposition on a square matrix
std::vector<double> cholesky_decomposition(const std::vector<double> &matrix, size_t size) {
    std::vector<double> L(size * size, 0);

    for (size_t i = 0; i < size; ++i) {
        for (size_t j = 0; j <= i; ++j) {
            double sum = std::transform_reduce(std::execution::par_unseq,
                                                L.begin() + i * size,
                                                L.begin() + i * size + j,
                                                L.begin() + j * size,
                                                0.0,
                                                std::plus<>(),
                                                std::multiplies<>());

            if (i == j) {
                L[i * size + j] = std::sqrt(matrix[i * size + i] - sum);
            } else {
                L[i * size + j] = (1.0 / L[j * size + j]) * (matrix[i * size + j] - sum);
            }
        }
    }

    return L;
}


std::vector<double> lower_triangular_inverse(const std::vector<double> &L, size_t size) {
    std::vector<double> Linv(size * size, 0);

    for (size_t i = 0; i < size; ++i) {
        Linv[i * size + i] = 1.0 / L[i * size + i];

        for (size_t j = i + 1; j < size; ++j) {
            double sum = std::transform_reduce(std::execution::par_unseq,
                                                L.begin() + j * size + i,
                                                L.begin() + j * size + j,
                                                Linv.begin() + i * size,
                                                0.0,
                                                std::plus<>(),
                                                std::multiplies<>());

            Linv[j * size + i] = -sum / L[j * size + j];
        }
    }

    return Linv;
}



// Compute Cholesky inverse
std::vector<double> cholesky_inverse(const std::vector<double> &matrix, size_t size) {
    std::vector<double> L = cholesky_decomposition(matrix, size);
    std::vector<double> Linv = lower_triangular_inverse(L, size);

    // Linv^T * Linv
    std::vector<double> inverse(size * size, 0);
    for (size_t i = 0; i < size; ++i) {
        for (size_t j = 0; j < size; ++j) {
            inverse[i * size + j] = std::transform_reduce(std::execution::par_unseq,
                                                           Linv.begin() + i * size,
                                                           Linv.begin() + (i + 1) * size,
                                                           Linv.begin() + j * size,
                                                           0.0,
                                                           std::plus<>(),
                                                           std::multiplies<>());
        }
    }

    return inverse;
}

Mar 16 '23 17:03 Dicklesworthstone

For a version that only uses the stdlib:

#include <iostream>
#include <vector>
#include <cmath>
#include <cassert>
#include <execution>

// Perform Cholesky decomposition on a square matrix
std::vector<double> cholesky_decomposition(const std::vector<double> &matrix, size_t size) {
    std::vector<double> L(size * size, 0);

    for (size_t i = 0; i < size; ++i) {
        for (size_t j = 0; j <= i; ++j) {
            double sum = std::transform_reduce(std::execution::par_unseq,
                                                L.begin() + i * size,
                                                L.begin() + i * size + j,
                                                L.begin() + j * size,
                                                0.0,
                                                std::plus<>(),
                                                std::multiplies<>());

            if (i == j) {
                L[i * size + j] = std::sqrt(matrix[i * size + i] - sum);
            } else {
                L[i * size + j] = (1.0 / L[j * size + j]) * (matrix[i * size + j] - sum);
            }
        }
    }

    return L;
}


std::vector<double> lower_triangular_inverse(const std::vector<double> &L, size_t size) {
    std::vector<double> Linv(size * size, 0);

    for (size_t i = 0; i < size; ++i) {
        Linv[i * size + i] = 1.0 / L[i * size + i];

        for (size_t j = i + 1; j < size; ++j) {
            double sum = std::transform_reduce(std::execution::par_unseq,
                                                L.begin() + j * size + i,
                                                L.begin() + j * size + j,
                                                Linv.begin() + i * size,
                                                0.0,
                                                std::plus<>(),
                                                std::multiplies<>());

            Linv[j * size + i] = -sum / L[j * size + j];
        }
    }

    return Linv;
}



// Compute Cholesky inverse
std::vector<double> cholesky_inverse(const std::vector<double> &matrix, size_t size) {
    std::vector<double> L = cholesky_decomposition(matrix, size);
    std::vector<double> Linv = lower_triangular_inverse(L, size);

    // Linv^T * Linv
    std::vector<double> inverse(size * size, 0);
    for (size_t i = 0; i < size; ++i) {
        for (size_t j = 0; j < size; ++j) {
            inverse[i * size + j] = std::transform_reduce(std::execution::par_unseq,
                                                           Linv.begin() + i * size,
                                                           Linv.begin() + (i + 1) * size,
                                                           Linv.begin() + j * size,
                                                           0.0,
                                                           std::plus<>(),
                                                           std::multiplies<>());
        }
    }

    return inverse;
}

Can you fork llama.cpp and see if you can get one of these working with chatGPT 4's help? I've found that it is very good at troubleshooting and writing entire programs, especially when you are assigned a 32k token context window instance.

Mar 16 '23 22:03 MarkSchmidty

Hello Georgi! Thank you for your exellent job. There are 4 models that already have GPTQ Quantization. https://rentry.org/llama-tard-v2 https://huggingface.co/maderix/llama-65b-4bit/tree/main Meaybe you just convert them into your format?

Mar 17 '23 14:03 Sergy2001

I did not expect the comments to contain generated code suggestions.

Mar 17 '23 21:03 antonkoenig

llama.cpp llama.cpp copied to clipboard

GPTQ Quantization (3-bit and 4-bit)

llama.cpp
llama.cpp copied to clipboard