[QUESTION] Quantizing in a different way...

Open 0wwafa opened this issue 1 year ago • 1 comments

Hello! I did some research (using llama.cpp) and I found out that quantizing the input and embed tensors to f16 and the other tensors to q5_k or q6_k gives excellent results and almost indistinguishable from the pure f16 and with half the size.

Is it possible to do the same with bitsandbytes/transformers so to produce a model quantized in this way from a normal model?

You find my (gguf) quantization at https://huggingface.co/ZeroWw for reference.

Thanks.

Jun 23 '24 18:06 0wwafa

Hey. nobody answered here..

Jul 01 '24 23:07 0wwafa