Support 4bit on CPU backend

Open Xia-Weiwen opened this issue 1 year ago • 0 comments

Adds implementation for the following ops on CPU backend:

quantize_4bit
dequantize_4bit
gemv_4bit

Limitations:

quant_storage must be torch.uint8
compress_statistics is not supported yet (bnb_4bit_use_double_quant must be false)
fp4 is slow currently because there is no fused kernel yet.

Difference from CUDA implementation:

On CPU backend, it is not required that A is a vector to go to the fused dequant-gemm kernel. CUDA requires that. So, the op is called gemv_4bit. But on CPU backend, it's actually GEMM.
Different numerical accuracy due to different kernel implementations

Here is the code snippet of an example to run HuggingFace models with 4bit on CPU backend: https://gist.github.com/Xia-Weiwen/592d6e24e03f904a18692b3e27794c53. You will have to bypass CUDA checks in transformers to run.

cc @jiqing-feng @jgong5 @jianan-gu

May 10 '24 07:05 Xia-Weiwen