llama.cpp Support BitNet b1.58 ternary models

New paper just dropped on Arxiv describing a way to train models in 1.58 bits (with ternary values: 1,0,-1). Paper shows performance increases from equivalently-sized fp16 models, and perplexity nearly equal to fp16 models. Authors state that their test model is built on LLaMA architecture and can be easily adapted to llama.cpp.

[Edited to add: Further reading into it by fellow Redditors shows that we can't use this to quantize existing models trained to fp16. They'd have to be trained in this ternary mode from the start. But I think it would still be something that we should implement, because models of that flavor will be coming soon.]

This is all over Reddit /LocalLLaMA right now:

https://www.reddit.com/r/LocalLLaMA/comments/1b21bbx/this_is_pretty_revolutionary_for_the_local_llm/

I think, if my napkin math is right, it would let us run something like 120B models in 24 GB VRAM, or 30B in... 8 GB?

Please implement @ggerganov and friends!

https://arxiv.org/abs/2402.17764

Feb 28 '24 09:02 igorbarshteyn

Wow, that is indeed very promising. And quite different from the current quant approaches too. Seems like instead of quantizing models post-training it quants them during training. I am sure though if this approach proves to be sucessful, model trainers like Jon Durbin, Teknium and Eric Hartford will jump in quickly.

Aside from obivious benefits during inference, In theory that could also allow much higher quality Lora training at less memory costs? You could theoretically train on GGUF models but that is generally not recommended as quality suffers too much from it compared to a fp16 model, so it seems this approach would help in that regard as well.

@ikawrakow What do you think about this paper?

Feb 28 '24 11:02 Dampfinchen

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

Feb 28 '24 12:02 ikawrakow

Let's wait till they post the actual code up... then maybe it will be more clear :)

Feb 28 '24 12:02 igorbarshteyn

Please implement @ggerganov and friends!

Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct?

But I share @ikawrakow's sentiment - let's wait and see first

Feb 28 '24 12:02 ggerganov

:) Yes. Let's wait till the authors' code is up. Really hoping this is going to be the way of the future :)

Feb 28 '24 12:02 igorbarshteyn

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

As I've understand it, the figures in that table are not meant to represent the model size, but the actual GPU memory usage during inference. So those 2.22 GB include the KV-Cache. Given it's llama without GQA I would imagine it being quite big.

Feb 28 '24 13:02 Dampfinchen

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.

If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Feb 28 '24 13:02 ikawrakow

Let's pray they used 256 block size 😄

Feb 28 '24 13:02 ggerganov

No, I'm actually hoping the hidden dimension is from the Fibonacci sequence. So we finally get a ggml that does not use blocks 😄

Feb 28 '24 14:02 ikawrakow

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

The models presented in these papers are not quantized. They are using ternary parameters (-1, 0, 1) not quantization, so it's a full-sized model. So, I don't think expectations for the size of quantized models would apply in this case. Either way, we'll know when they release code.

Feb 28 '24 16:02 jetro30087

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.

Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with ternary weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.

Feb 28 '24 17:02 netrunnereve

Well, I have been wondering for a while why nobody is training quantized models directly

Same, and I was also hoping that given 3-4bit weights, it might reduce the solution surface so dramatically that we might even drop the backprop nonsense entirely and use something else... (for the pretraining, because for fine-tuning, it kinda makes sense because you want just little nudge, not dramatical change)

If 1-2 bit is feasible, than this might again change the problem space and maybe we could go straight to random evolutionary algorithms or something like that. I wonder why nobody tried that (and I hope it's not because I'm idiot)

Feb 28 '24 18:02 cztomsik

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.

Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with 3 bit weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.

If this pans out, we should see everyone switching to it and throwing 10 times more parameters in the model. Plus NVIDIA should take notice of this.

Feb 28 '24 18:02 errorsandwarnings

Designing hardware around pure adders seems so damn juicy, god damn that would be so insanely fast.

Feb 28 '24 18:02 Gobz

Did some simple linear regression from the data in the paper, I hope their data is legit

Feb 28 '24 20:02 Gobz

Those addition-only matrix operations are brilliant. This could be so fast in the future with dedicated ASICs.

@igorbarshteyn could you clean up the title of this issue a bit though? Maybe just something like:

Support BitNet b1.58 ternary models

Feb 28 '24 21:02 EwoutH

Done @EwoutH

Feb 28 '24 21:02 igorbarshteyn

Did some simple linear regression from the data in the paper, I hope their data is legit

Nice table, thank you for the demonstration. The cool thing is that these figures are during inference of outdated non-GQA models. So with modern GQA models, the VRAM usage would be even smaller than what's listed here.

Feb 28 '24 22:02 Dampfinchen

Code will be populated here when they are ready:

https://github.com/microsoft/unilm/tree/master/bitnet

Feb 29 '24 03:02 igorbarshteyn

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.

If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

Feb 29 '24 04:02 kinchahoy

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

Feb 29 '24 05:02 sorasoras

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.

Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.

For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

Feb 29 '24 05:02 kinchahoy

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.

Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.

For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

it's probably easier to train model with data output from better F16 model. SPIN I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation it's hard to get better for now.

Feb 29 '24 06:02 sorasoras

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better. Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques. For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

it's probably easier to train model with data output from better F16 model. SPIN I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation it's hard to get better for now.

Yeah I agree! Though broadly I think of SPIN as one of the class of teacher student distillation techniques. Either way - this should be possible, and has incredible potential. I really don't see the community investing in training cutting edge 60B+ parameter <2 bit models, so we really need to find clever ways to extract the right weights starting from successful fp16 models.

Feb 29 '24 06:02 kinchahoy

These papers might be a practical approach for existing model conversion: [https://openreview.net/forum?id=FUnEkOkodU](Token-Scaled Logit Distillation for Ternary Weight Generative Language Models) [https://huggingface.co/papers/2306.01841](Binary and Ternary Natural Language Generation)

Feb 29 '24 06:02 WebsiteInc

The 1-bit idea in the Bitnet paper (https://arxiv.org/abs/2310.11453) has been adopted in this recent 1-bit quantization paper (https://arxiv.org/abs/2402.11295).

Feb 29 '24 11:02 tuyen-huynh

Hey everyone - I'm looking forward to fully implement this on HVM. It is an interaction-net based runtime which sometimes wields counter-intuitive speedups on certain algorithms. I think there is a non-zero chance it'd be able to speedup transformers asymptotically, so, I'd like to try. The only barrier to doing that was the usage of floats, but if we can actually implement training with Ints only, then it applies. If anyone ever implements or if the code for this paper is published, please let me know so I can port it to HVM :)

Feb 29 '24 12:02 VictorTaelin

These papers might be a practical approach for existing model conversion: [https://openreview.net/forum?id=FUnEkOkodU](Token-Scaled Logit Distillation for Ternary Weight Generative Language Models) [https://huggingface.co/papers/2306.01841](Binary and Ternary Natural Language Generation)

These are a great resource thank you! Might try something with them this weekend.

Feb 29 '24 22:02 kinchahoy

You guys are all sorcerers. I deeply appreciate whatever blood pact was required to wield this arcane wizardry.

Mar 01 '24 02:03 CamiloMM

Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct?

@ggerganov The key implementation difference with ternary weights is that you get to do away with the multiplication altogether; computing the dot product is just a matter of conditional (or bit-masked SIMD) summations and one subtraction at the end.

Mar 01 '24 03:03 nickovs

llama.cpp llama.cpp copied to clipboard

Support BitNet b1.58 ternary models

llama.cpp
llama.cpp copied to clipboard