exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Support for NF4?

Open hoagy-davis-digges opened this issue 1 year ago • 1 comments

Is there a plan to include support for the NF4 data type from the qlora paper?

hoagy-davis-digges avatar Aug 07 '23 09:08 hoagy-davis-digges

As far as I can tell it's very hard to use efficiently in CUDA, since you need to run every quantized element through a lookup table, or as they've done in the bitsandbytes CUDA kernels, with a tree of conditional statements like:

  if((val & 0b1000) == 8)
    if((val & 0b0100) == 4) // 1
      if((val & 0b0010) == 2) // 11
        if((val & 0b0001) == 1) // 111
          return 1.0f; 
        else
          return 0.7229568362236023f;
      else
        if((val & 0b0001) == 1) // 110
          return 0.5626170039176941f; 
        else
          return 0.44070982933044434f; 
    else
      if((val & 0b0010) == 2) //10
        if((val & 0b0001) == 1) // 101
          return 0.33791524171829224f; 
        else
          return 0.24611230194568634f; 
      else 
        if((val & 0b0001) == 1) // 100
          return 0.16093020141124725f; 
        else
          return 0.07958029955625534f;

Now, I have to assume they tested this to be faster than a constant-memory lookup table, but it's still a bit of a thread divergence horror show. So many places this can (and will) stall each warp every time you load an element.

I'm also not convinced there's any benefit for inference compared to linear quantization on a variable grid, as used in GPTQ, GGML etc. (Quantized training and finetuning is another matter, of course.)

So short answer would no, no immediate plans to support NF4 tensors.

turboderp avatar Aug 07 '23 10:08 turboderp