gemma.cpp [Feature request] Add quantization methods

[Feature request] Add quantization methods

Open namtranase opened this issue 1 year ago • 4 comments

It would be awesome if the repo supported quantization methods. Reference: k-quants

Feb 23 '24 09:02 namtranase

waiting for quantization model +1.

Feb 27 '24 12:02 chenxiaoyu3

Understood, the -sfp models are 8 bit weights, but I understand people are interested in more aggressive quantization.

BTW for just decreasing the memory footprint there was a commit that makes the kv cache preallocation smaller + configurable https://github.com/google/gemma.cpp/commit/129e66ada2b4e461bdf28b88b70cd2465cb213e4 - but I get aggressive quantization benefits go beyond that.

Working on a list of priorities + call-for-contributions, will post more soon.

Feb 27 '24 12:02 austinvhuang

FYI we do support an experimental 4.5 bit quantization method (NUQ), but those weights are not available on Kaggle. We can more easily support this once we are able to ingest other weight formats (#11).

Feb 28 '24 02:02 jan-wassenberg

An update on this, we do have the ability to import from pytorch weights. Work is still ongoing on evaluating the nonuniform 4.5-bit format.

I'm increasingly concerned about uniform integer quantization in the style of k quants. Recent work such as https://arxiv.org/pdf/2407.03211 points out that human raters detect much more harm than automated metrics, especially in non-English languages, even for int8. Another paper also reports concerns after human evals, apparently also with int8.

Jul 15 '24 10:07 jan-wassenberg

gemma.cpp gemma.cpp copied to clipboard

[Feature request] Add quantization methods

gemma.cpp
gemma.cpp copied to clipboard