bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

FLUTE Integration for Fast Inference

Open HanGuo97 opened this issue 7 months ago • 12 comments

Feature request

Hi, we are big fans of the library and the NF4 data-type, so much so that we have been working on CUDA kernels to speed-up inference for NF4-quantized models (and more). We'd love to explore ways we can integrate FLUTE into the bitsandbytes library.

Motivation

I think FLUTE could make two contributions to bitsandbytes:

  • An inference-time kernel for NF4-quantized LLM (and more). For example, this could help extend the current GEMV kernel into GEMM (i.e., batched inference).
  • A new quantization algorithm that lightly extends bitsandbytes's NF4 into a learned version.

The kernel and algorithm still have room for improvements. For example, it has limited GPU support, and specializes in specific shapes. These are hopefully not hard limits and could be relaxed. As such, we do not expect FLUTE to be the default option. That being said, we thought it'd be great to have FLUTE be an opt-in feature.

Your contribution

We are happy to submit PRs, though I'm sure there will be some rough edges that we will need some help with.

HanGuo97 avatar Jul 25 '24 14:07 HanGuo97