lmql
lmql copied to clipboard
hope additional backend inference support for Exllama.
hope additional backend inference support for Exllama. I have tested it, and Exllama is truly fast. Llama2-7b is twice as fast, while Llama2-70b can enhance speed by 50% to 80%.
Yes, ExLlama seems very impressive. Do you know what kind of low-level interface they offer. LMQL requires full distribution access for masking and logprobs.
Yes, ExLlama seems very impressive. Do you know what kind of low-level interface they offer. LMQL requires full distribution access for masking and logprobs.
Unfortunately i couldn't find any Exllama official guide about how to use logit_bias. https://github.com/turboderp/exllama/pull/104 But i found this issue, "zsmarty" seems found a solution. He wrote a function like this. hope it can give us some idea.
def generate_token_with_bias(self, prefix, logit_bias, startNewRequest = True):
self.end_beam_search()
if prefix and len(prefix) > 0:
ids, mask = self.tokenizer.encode(prefix, return_mask = True, max_seq_len = self.model.config.max_seq_len)
if (startNewRequest):
self.gen_begin(ids, mask = mask)
else:
self.gen_feed_tokens(ids, mask)
token = self.gen_single_token(logit_bias=logit_bias)
text = self.tokenizer.decode(token)
return text[0]
I am not very sure if it suitable for lmql.
You can load GPTQ models using AutoGPTQ, which uses Exllama v2 by default (if installed).
modelpath = "TheBloke/vicuna-7B-v0-GPTQ"
lmql.model(modelpath, tokenizer=modelpath, loader="auto-gptq",
disable_exllamav2=False, use_safetensors=True
inject_fused_attention=False, inject_fused_mlp=False
)
After some testing it's pretty finnicky and needs tight constraints, but it's hard to tell whether its the backend or the quant itself.