exllama
exllama copied to clipboard
Support for NF4?
Is there a plan to include support for the NF4 data type from the qlora paper?
As far as I can tell it's very hard to use efficiently in CUDA, since you need to run every quantized element through a lookup table, or as they've done in the bitsandbytes
CUDA kernels, with a tree of conditional statements like:
if((val & 0b1000) == 8)
if((val & 0b0100) == 4) // 1
if((val & 0b0010) == 2) // 11
if((val & 0b0001) == 1) // 111
return 1.0f;
else
return 0.7229568362236023f;
else
if((val & 0b0001) == 1) // 110
return 0.5626170039176941f;
else
return 0.44070982933044434f;
else
if((val & 0b0010) == 2) //10
if((val & 0b0001) == 1) // 101
return 0.33791524171829224f;
else
return 0.24611230194568634f;
else
if((val & 0b0001) == 1) // 100
return 0.16093020141124725f;
else
return 0.07958029955625534f;
Now, I have to assume they tested this to be faster than a constant-memory lookup table, but it's still a bit of a thread divergence horror show. So many places this can (and will) stall each warp every time you load an element.
I'm also not convinced there's any benefit for inference compared to linear quantization on a variable grid, as used in GPTQ, GGML etc. (Quantized training and finetuning is another matter, of course.)
So short answer would no, no immediate plans to support NF4 tensors.