bitsandbytes
bitsandbytes copied to clipboard
Support 4bit on CPU backend
Adds implementation for the following ops on CPU backend:
- quantize_4bit
- dequantize_4bit
- gemv_4bit
Limitations:
quant_storagemust be torch.uint8compress_statisticsis not supported yet (bnb_4bit_use_double_quantmust be false)fp4is slow currently because there is no fused kernel yet.
Difference from CUDA implementation:
- On CPU backend, it is not required that A is a vector to go to the fused dequant-gemm kernel. CUDA requires that. So, the op is called
gemv_4bit. But on CPU backend, it's actually GEMM. - Different numerical accuracy due to different kernel implementations
Here is the code snippet of an example to run HuggingFace models with 4bit on CPU backend: https://gist.github.com/Xia-Weiwen/592d6e24e03f904a18692b3e27794c53. You will have to bypass CUDA checks in transformers to run.
cc @jiqing-feng @jgong5 @jianan-gu