please add mixed quantizations.

Open 0wwafa opened this issue 1 year ago • 2 comments

Feature request

As of now bitsandbytes allows only to quantize a model all in the same way. This is ok, but I found out that in most cases the best quantization is to quantize the output and embed tensors to f16 and every other tensor to q8_0, q6_k or q5_k, q4_k...

Motivation

In llama.cpp I can do exactly that and everyone who tested my quantizations was happy with the results.

Your contribution

Can't help directly, but it should be quite easy to implement... just put an option for the embed and output (separately like in llama-quantize).

Jul 01 '24 23:07 0wwafa