llama.cpp
llama.cpp copied to clipboard
Support BitNet b1.58 ternary models
New paper just dropped on Arxiv describing a way to train models in 1.58 bits (with ternary values: 1,0,-1). Paper shows performance increases from equivalently-sized fp16 models, and perplexity nearly equal to fp16 models. Authors state that their test model is built on LLaMA architecture and can be easily adapted to llama.cpp.
[Edited to add: Further reading into it by fellow Redditors shows that we can't use this to quantize existing models trained to fp16. They'd have to be trained in this ternary mode from the start. But I think it would still be something that we should implement, because models of that flavor will be coming soon.]
This is all over Reddit /LocalLLaMA right now:
https://www.reddit.com/r/LocalLLaMA/comments/1b21bbx/this_is_pretty_revolutionary_for_the_local_llm/
I think, if my napkin math is right, it would let us run something like 120B models in 24 GB VRAM, or 30B in... 8 GB?
Please implement @ggerganov and friends!
https://arxiv.org/abs/2402.17764
Wow, that is indeed very promising. And quite different from the current quant approaches too. Seems like instead of quantizing models post-training it quants them during training. I am sure though if this approach proves to be sucessful, model trainers like Jon Durbin, Teknium and Eric Hartford will jump in quickly.
Aside from obivious benefits during inference, In theory that could also allow much higher quality Lora training at less memory costs? You could theoretically train on GGUF models but that is generally not recommended as quality suffers too much from it compared to a fp16 model, so it seems this approach would help in that regard as well.
@ikawrakow What do you think about this paper?
Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16
model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16
if one trained a quantized model directly.
Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16
, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.
Let's wait till they post the actual code up... then maybe it will be more clear :)
Please implement @ggerganov and friends!
Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct?
But I share @ikawrakow's sentiment - let's wait and see first
:) Yes. Let's wait till the authors' code is up. Really hoping this is going to be the way of the future :)
Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than
fp16
, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.
As I've understand it, the figures in that table are not meant to represent the model size, but the actual GPU memory usage during inference. So those 2.22 GB include the KV-Cache. Given it's llama without GQA I would imagine it being quite big.
The IQ1_S
quantization uses exactly that: tertiary values -1, 0, 1
, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp
:-)
Let's pray they used 256 block size 😄
No, I'm actually hoping the hidden dimension is from the Fibonacci sequence. So we finally get a ggml
that does not use blocks 😄
Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the
fp16
model with relatively simple means, it is kind of obvious that one should be able to get the same performance asfp16
if one trained a quantized model directly.Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than
fp16
, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.
The models presented in these papers are not quantized. They are using ternary parameters (-1, 0, 1) not quantization, so it's a full-sized model. So, I don't think expectations for the size of quantized models would apply in this case. Either way, we'll know when they release code.
Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the
fp16
model with relatively simple means, it is kind of obvious that one should be able to get the same performance asfp16
if one trained a quantized model directly.
I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.
Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with ternary weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.
Well, I have been wondering for a while why nobody is training quantized models directly
Same, and I was also hoping that given 3-4bit weights, it might reduce the solution surface so dramatically that we might even drop the backprop nonsense entirely and use something else... (for the pretraining, because for fine-tuning, it kinda makes sense because you want just little nudge, not dramatical change)
If 1-2 bit is feasible, than this might again change the problem space and maybe we could go straight to random evolutionary algorithms or something like that. I wonder why nobody tried that (and I hope it's not because I'm idiot)
Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the
fp16
model with relatively simple means, it is kind of obvious that one should be able to get the same performance asfp16
if one trained a quantized model directly.I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.
Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with 3 bit weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.
If this pans out, we should see everyone switching to it and throwing 10 times more parameters in the model. Plus NVIDIA should take notice of this.
Designing hardware around pure adders seems so damn juicy, god damn that would be so insanely fast.
Did some simple linear regression from the data in the paper, I hope their data is legit
Those addition-only matrix operations are brilliant. This could be so fast in the future with dedicated ASICs.
@igorbarshteyn could you clean up the title of this issue a bit though? Maybe just something like:
Support BitNet b1.58 ternary models
Done @EwoutH
Did some simple linear regression from the data in the paper, I hope their data is legit
Nice table, thank you for the demonstration. The cool thing is that these figures are during inference of outdated non-GQA models. So with modern GQA models, the VRAM usage would be even smaller than what's listed here.
Code will be populated here when they are ready:
https://github.com/microsoft/unilm/tree/master/bitnet
The
IQ1_S
quantization uses exactly that: tertiary values-1, 0, 1
, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to
llama.cpp
:-)
Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?
The
IQ1_S
quantization uses exactly that: tertiary values-1, 0, 1
, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods tollama.cpp
:-)Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?
What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.
The
IQ1_S
quantization uses exactly that: tertiary values-1, 0, 1
, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods tollama.cpp
:-)Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?
What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.
If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.
Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.
For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.
The
IQ1_S
quantization uses exactly that: tertiary values-1, 0, 1
, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods tollama.cpp
:-)Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?
What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.
If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.
Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.
For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.
it's probably easier to train model with data output from better F16 model. SPIN I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation it's hard to get better for now.
The
IQ1_S
quantization uses exactly that: tertiary values-1, 0, 1
, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary. If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods tollama.cpp
:-)Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?
What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.
If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better. Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques. For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.
it's probably easier to train model with data output from better F16 model. SPIN I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation it's hard to get better for now.
Yeah I agree! Though broadly I think of SPIN as one of the class of teacher student distillation techniques. Either way - this should be possible, and has incredible potential. I really don't see the community investing in training cutting edge 60B+ parameter <2 bit models, so we really need to find clever ways to extract the right weights starting from successful fp16 models.
These papers might be a practical approach for existing model conversion: [https://openreview.net/forum?id=FUnEkOkodU](Token-Scaled Logit Distillation for Ternary Weight Generative Language Models) [https://huggingface.co/papers/2306.01841](Binary and Ternary Natural Language Generation)
The 1-bit idea in the Bitnet paper (https://arxiv.org/abs/2310.11453) has been adopted in this recent 1-bit quantization paper (https://arxiv.org/abs/2402.11295).
Hey everyone - I'm looking forward to fully implement this on HVM. It is an interaction-net based runtime which sometimes wields counter-intuitive speedups on certain algorithms. I think there is a non-zero chance it'd be able to speedup transformers asymptotically, so, I'd like to try. The only barrier to doing that was the usage of floats, but if we can actually implement training with Ints only, then it applies. If anyone ever implements or if the code for this paper is published, please let me know so I can port it to HVM :)
These papers might be a practical approach for existing model conversion: [https://openreview.net/forum?id=FUnEkOkodU](Token-Scaled Logit Distillation for Ternary Weight Generative Language Models) [https://huggingface.co/papers/2306.01841](Binary and Ternary Natural Language Generation)
These are a great resource thank you! Might try something with them this weekend.
You guys are all sorcerers. I deeply appreciate whatever blood pact was required to wield this arcane wizardry.
Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct?
@ggerganov The key implementation difference with ternary weights is that you get to do away with the multiplication altogether; computing the dot product is just a matter of conditional (or bit-masked SIMD) summations and one subtraction at the end.