candle icon indicating copy to clipboard operation
candle copied to clipboard

1.58 bit implementation

Open okpatil4u opened this issue 1 year ago • 5 comments
trafficstars

Would it possible to implement 1.58 bit quantization on candle ? It was proposed in the following paper,

https://arxiv.org/pdf/2402.17764.pdf

The main inspiration behind using 1.58 bit implementation is that you could replace matrix multiplication with addition. If that is feasible, with apple accelerate framework's SIMD instructions, we could expect better training and inference on large language models.

A couple of Llama.cpp discussions here

https://github.com/ggerganov/llama.cpp/issues/5761 https://github.com/ggerganov/llama.cpp/pull/5999

There is also a training library which was released a couple of days ago, https://github.com/rafacelente/bllama

Any thoughts ?

okpatil4u avatar Mar 28 '24 11:03 okpatil4u

Are there some reference trained models somewhere? I haven't been able to find any so far.

LaurentMazare avatar Mar 29 '24 23:03 LaurentMazare

Apparently this one trains a 54M parameter mode from scratch.

https://github.com/pranavjad/tinyllama-bitnet

And this one is a pretty good technique for quantization which retains the model performance. They have also released the model weights.

https://mobiusml.github.io/1bit_blog/

What is more interesting to me is the replacement of matrix multiplication with addition leading to significant performance gains.

okpatil4u avatar Mar 30 '24 04:03 okpatil4u

And the official models are here

https://huggingface.co/1bitLLM/bitnet_b1_58-3B

okpatil4u avatar Mar 30 '24 04:03 okpatil4u

Not sure how close to complete this is but @tomsanbear has put up bitnet-rs which seems to be a candle implementation of this archicecture.

LaurentMazare avatar Apr 01 '24 17:04 LaurentMazare

Thanks @LaurentMazare, this is super helpful.

okpatil4u avatar Apr 02 '24 09:04 okpatil4u