bitsandbytes
bitsandbytes copied to clipboard
FLUTE Integration for Fast Inference
Feature request
Hi, we are big fans of the library and the NF4 data-type, so much so that we have been working on CUDA kernels to speed-up inference for NF4-quantized models (and more). We'd love to explore ways we can integrate FLUTE into the bitsandbytes
library.
Motivation
I think FLUTE could make two contributions to bitsandbytes:
- An inference-time kernel for NF4-quantized LLM (and more). For example, this could help extend the current GEMV kernel into GEMM (i.e., batched inference).
- A new quantization algorithm that lightly extends bitsandbytes's NF4 into a learned version.
The kernel and algorithm still have room for improvements. For example, it has limited GPU support, and specializes in specific shapes. These are hopefully not hard limits and could be relaxed. As such, we do not expect FLUTE to be the default option. That being said, we thought it'd be great to have FLUTE be an opt-in feature.
Your contribution
We are happy to submit PRs, though I'm sure there will be some rough edges that we will need some help with.