AutoAWQ icon indicating copy to clipboard operation
AutoAWQ copied to clipboard

3-bit or 6-bit quantization

Open khurramusman-10xe opened this issue 10 months ago • 3 comments

Hello! I have been playing around with AutoAWQ for a couple of weeks now and have managed to run it on LlaVa and then evaluate the quantized version using the lmms-eval library. I have now gotten to a point where I wanted to test the performance of 3-bit and 6-bit quantizations. I understand that the current kernels (I was using GEMM from the default example) only support 4-bit. I could not find any such limitation in the quantization process though. To put it more concretely, the computation of the scales and the clipping is not limited by the kernel and it could in theory work for any quantization resolution. That means that I can run the quantization for whatever resolution I want. The problem is only at the very last step of the quantization process when the GEMM (or any of the other kernels) are called OR when you try and load the quantized model using the "from_quantized" method (assuming the last step of the quant process is somehow taken care of). Those are the only two places where the kernel is called and the error about the 4-bit only support breaks the execution. My question is that is there a simple way to run the quantized model without using the kernels just for the purpose of performance evaluation?? My immediate goal is to try and compare the performance at different quantization levels and if there is an easy way to do this, that would be a good starting point. Further down the line, I would be interested in running the quantized models more efficiently and that's where optimized kernels would start coming in (AFAIK). If someone can also point me in the direction of how one would do that or if there are any recipes for that, that would be useful as well. Thanks!

khurramusman-10xe avatar Jan 21 '25 04:01 khurramusman-10xe

We do not have a solution to store weights in 3 or 6 bits, nor do we know how to run inference just yet. I’m open for PRs on this

casper-hansen avatar Jan 21 '25 06:01 casper-hansen

Thanks for the response @casper-hansen.

I see -- I am still finding my way around this but I have come across other quantization methods supporting 2 or 3 bit quant. Are you (at a high level) aware of how they do that? If you have any pointers, that would be helpful. And when I do end up figuring something out, I will be more than happy to contribute.

khurramusman-10xe avatar Jan 21 '25 07:01 khurramusman-10xe

To add to the above discussion, I believe it is still possible to just quantize the weights and just keep them in float16 or 32 for the bare minimum of performance evaluation, right? Of course, the memory / inference speedup gains won't be achieved. Its just a crude way to see what such a quantization level would do in terms of performance.

khurramusman-10xe avatar Jan 21 '25 07:01 khurramusman-10xe