bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

Support 4bit on CPU backend

Open Xia-Weiwen opened this issue 1 year ago • 0 comments

Adds implementation for the following ops on CPU backend:

  • quantize_4bit
  • dequantize_4bit
  • gemv_4bit

Limitations:

  • quant_storage must be torch.uint8
  • compress_statistics is not supported yet (bnb_4bit_use_double_quant must be false)
  • fp4 is slow currently because there is no fused kernel yet.

Difference from CUDA implementation:

  • On CPU backend, it is not required that A is a vector to go to the fused dequant-gemm kernel. CUDA requires that. So, the op is called gemv_4bit. But on CPU backend, it's actually GEMM.
  • Different numerical accuracy due to different kernel implementations

Here is the code snippet of an example to run HuggingFace models with 4bit on CPU backend: https://gist.github.com/Xia-Weiwen/592d6e24e03f904a18692b3e27794c53. You will have to bypass CUDA checks in transformers to run.


cc @jiqing-feng @jgong5 @jianan-gu

Xia-Weiwen avatar May 10 '24 07:05 Xia-Weiwen