exllama
exllama copied to clipboard
SqueezeLLM Support?
https://github.com/SqueezeAILab/SqueezeLLM is this something exllama will support out of the box? how would integrating support look like?
I've had a quick look at SqueezeLLM. From what I can tell it's another quantization scheme that makes big promises but isn't even fully published yet. There's just example code for inference, nothing for converting models, and we don't know how it will affect 33B and 65B which are usually less sensitive to quantization anyway.
That said, it's essentially just GPTQ plus a lookup table, so it might be fairly quick to implement. It's probably going to fairly slow because of that, but maybe I can come up with some neat trick to use it efficiently. Although, I expect many would want the 3-bit version, and that would take a bit more work to add.
But we'll have to see. They'll at least need to release their code for converting models, since otherwise all we've got are base Llama 7B and 13B. Oh, and Vicuna. But that's still not all that exciting.