Thoughts on Quantization Roadmap
I'm new to this specific project, and I don't say any of the following with high confidence.
Things that I see as important for quantization:
Inference speed
- AWQ seems best on this front, then GPTQ, then bitsandbytes-nf4.
Perplexity
- AWQ is better than GPTQ (especially with act order false)
- bnb nf4 quality seems better than AWQ and GPTQ
Ability to merge LoRA adapters to the base model
- This is a big problem with AWQ and GPTQ, and means that many people are training in bnb nf4, then merging to an unquantized base (also not ideal) and then re-quantizing.
Ability to serialize models and push to huggingface
- Possible with AWQ, GPTQ but not (yet) with bnb nf4, which leads to devs having to merge onto an unquantized base model and then push.
Time to quantize models
- GPTQ is the slowest, then AWQ.
- BNB quantization can be done on the fly in transformers during inference. This would be a great feature - to start with a bf16 model and be able to quantize and run on the fly.
- GGUF is fast to quantize. This is a significant advantage when making/testing lots of models.
Robustness to quantization dataset choice
- AWQ, and even moreso, GPTQ, are dependent on their quantization dataset.
- I think it's underappreciated that bnb nf4 and gguf aren't dataset dependent. This makes them more robust when inferencing content that is less similar to the quantization dataset.
Other notes
- I note that NF4 seems to do better than INT4 as it is more data optimal (shaped like a normal distribution). I'm not sure if it adds a lot more complexity to do NF4 than INT4 kernels (and I saw this nice issue: https://github.com/ml-explore/mlx/issues/71). Perhaps worth trying NF4.
- Overall, I think there's a lot to be said for having a type of quantization that is quick to make because then devs save time and cost + it's possible to quantize a bf16 model and inference on the fly, without too much delay.
To add to the mix, EXL2, it is not only an inference lib that can load GPTQ models, but when a model is instead quantized by EXL2, it performs better than a GPTQ quantized model.
It also provides a whole gambit of quants. anything 2-8. And I believe it can be done in stages, enabling the possibility of half-steps. So 2.5 bits-per-weight (bpw) quant. is made possible.
https://github.com/turboderp/exllamav2
Because I am just a curious cat relying on MLX to be its best 🙃, I do nothing (and want nothing to do with the nuts & bolts of) quantization, array math and all, as to say, YMMV
Hi, I'm curious about how to make the quantized model work on mlx, convert it like the examples or quantize the model using this framework?
This paper is also currently making some pretty big waves:
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.