onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Feature Request] 4bit and 2bit and 1bit quantization support

Open elephantpanda opened this issue 2 years ago • 20 comments
trafficstars

Describe the feature request

Support for quantizing and running quantized models in 4bit, 2bit and 1bit. Also saving and loading these models in onnx format for lower file sizes.

The GPU doesn't necessarily have to support 4bit operations since it can just use gpu cores to convert them to float operations or int8 operations when needed.

Describe scenario use case

Some models such as Large Language Models are very big but run fairly well when quantized down to 8bit, 4bit, 2bit or even 1bit.

elephantpanda avatar Mar 10 '23 07:03 elephantpanda

Hi, Pauldog thanks for reaching out. We have received your message and put these requests under consideration!

Thank you for your time,

Jian Chen (not a A.I.)

jchen351 avatar Mar 10 '23 17:03 jchen351

Also Could you please provide me more information about your scenarios, like: hardware to you wants to run on, and models you are interested in? Again, out currently priority is on fp16 support. And there isn't any hardware we have that supports the 4bit or lower.

jchen351 avatar Mar 10 '23 18:03 jchen351

Sure here is a very recent example of a practical use case:

Llama 4bit

As far as I'm aware it doesn't require 4bit hardware it simply stores the weights on the GPU in 4bit, then uses GPU cores at runtime to convert them to int8 or float16 at runtime to do the calculations.

The main benefit is the ability to run larger models on the same hardware.

Use cases would be

  • Running very large language models on consumer hardware
  • Running large models on mobile hardware

Here are some papers

https://arxiv.org/abs/1810.05723 https://arxiv.org/abs/2202.05292

and articles https://karanbirchahal.medium.com/aggressive-quantization-how-to-run-mnist-on-a-4-bit-neural-net-using-pytorch-5703f3faa599

Now, I don't know whether onnxruntime already can support this or not? Since technically say a 4bit quantized model would presumably appear like an 8bit quantized model as two 4bits are combined into one 8bit.

elephantpanda avatar Mar 10 '23 20:03 elephantpanda

Hey @jchen351, I'm wondering why this is closed? Shouldn't it stay open if this is being considered?

The WebML ecosystem in particular could really do with a 4-bit quantization solution, since model size is such an important factor on the web.

josephrocca avatar Mar 14 '23 03:03 josephrocca

100% agree with @josephrocca. 4-bit quantization would be massive for my Transformers.js library (and other WebML libraries)!

xenova avatar Mar 14 '23 03:03 xenova

@xenova @josephrocca The only hardware we know that can support 4 bit quantization with performance gain is Nvidia A100, but we cannot get our hands on enough A100, and the newer H100 has dropped that support. We don't foresee any performance gain in doing 4 bit quantization on any other popular hardwares. So, until them, I will keep this closed :)

jchen351 avatar Mar 14 '23 21:03 jchen351

This repo supports 4-bit quantization: https://github.com/ggerganov/llama.cpp (And, as stated in the README, it runs on the CPU)

Also, considering that WASM uses a 32-bit address space (i.e., max 4GB), the only real way to get large models running on consumer hardware is quantization.

xenova avatar Mar 14 '23 21:03 xenova

@jchen351, yes, as xenova pointed out, this is more about running large models on hardware that has a small amount of memory, rather than performance improvements.

For example, please see this demo of llama 7B running on a pixel 5 at 1 token/sec using 4 bit quantization: https://twitter.com/ggerganov/status/1635605532726681600

So this issue can probably be re-opened considering it is viable to gain this benefit without hardware support? llama.cpp has grown faster than the original stable diffusion repo (which was one of the fastest growing of all time) because it allows people to run big models on small hardware -- there's definitely demand for this! :)

josephrocca avatar Mar 15 '23 01:03 josephrocca

@jchen351, can we have a second look at this? It's not really about performance, but rather allowing running models in places they couldn't before. I insist!

It just seems like the points that guys made, which are really valid, got seemingly plainly ignored.

skyne98 avatar Apr 24 '23 09:04 skyne98

Re-open please, everyone is using 4-5bit quantization now

tikikun avatar Jun 02 '23 01:06 tikikun

re-opening this. this should not be closed.

jywu-msft avatar Jul 06 '23 20:07 jywu-msft

+@yufenglee FYI

jywu-msft avatar Jul 06 '23 20:07 jywu-msft

Hi everyone! I have successfully quantized a diffusion model to 2-bit and manually packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. During inference, they are unpacked to float format for calculation. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M) in pytorch. However, when I export the model to onnx, only the model size is reduced (to around 190M), the VRAM for loading the model can still reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the onnx model.

Any ideas on how to lower the VRAM for loading this ONNX model? I have upload the model at googledrive.

ThisisBillhe avatar Jul 12 '23 01:07 ThisisBillhe

Hi everyone! I have successfully quantized a diffusion model to 2-bit and manually packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. During inference, they are unpacked to float format for calculation. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M) in pytorch. However, when I export the model to onnx, only the model size is reduced (to around 190M), the VRAM for loading the model can still reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the onnx model.

Any ideas on how to lower the VRAM for loading this ONNX model? I have upload the model at googledrive.

2-bit diffusion model? Does it actually produce images?

Guess you could try packing 16 2-bits into an int32.

elephantpanda avatar Jul 12 '23 02:07 elephantpanda

Hi everyone! I have successfully quantized a diffusion model to 2-bit and manually packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. During inference, they are unpacked to float format for calculation. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M) in pytorch. However, when I export the model to onnx, only the model size is reduced (to around 190M), the VRAM for loading the model can still reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the onnx model. Any ideas on how to lower the VRAM for loading this ONNX model? I have upload the model at googledrive.

2-bit diffusion model? Does it actually produce images?

Guess you could try packing 16 2-bits into an int32.

The work is in progress..I guess you make a point, I will have a try.

ThisisBillhe avatar Jul 12 '23 02:07 ThisisBillhe

are there any branches or forks of the 2 x 4bit packing?

dfiru avatar Oct 14 '23 15:10 dfiru

I noticed this point in the v1.16.0 release notes (3 weeks ago):

Support 4-bit quantization on CPU

I haven't tried it yet. @xenova I'm curious if you've tried this yet with the Web Wasm backend?

josephrocca avatar Oct 15 '23 03:10 josephrocca

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quant_utils.py#L71 QuantType still doesn't include it.

dfiru avatar Oct 15 '23 03:10 dfiru

Any updates on this issue?

Fritskee avatar Feb 20 '24 12:02 Fritskee

4 bit would indeed be great. Any updates?

ogencoglu avatar Feb 24 '24 11:02 ogencoglu

Being able to convert a HF model for 4-bit quantization would be awesome!!

ideasbyjin avatar Jun 17 '24 14:06 ideasbyjin

Being able to convert a HF model for 4-bit quantization would be awesome!!

The QLLM tool can convert a 4-bit HF model to ONNX: https://github.com/wejoncy/QLLM. And a tool from ORT Generate API can also convert it with this PR:https://github.com/microsoft/onnxruntime-genai/pull/600

yufenglee avatar Jun 18 '24 17:06 yufenglee

Thanks, I might be missing something but for my models (which are encoder-only models), I'm not sure how to get it to work. I was able to 4-bit quantize it using BitsAndBytes on HF, but not export it to ONNX

ideasbyjin avatar Jun 20 '24 13:06 ideasbyjin

Hi I see ONNX is now supporting 4bit data type. Is there any more information about how to make use of these and do quantization down to 4bit?

elephantpanda avatar Jul 24 '24 02:07 elephantpanda