lmql icon indicating copy to clipboard operation
lmql copied to clipboard

hope additional backend inference support for Exllama.

Open zhangyuhanjc opened this issue 1 year ago • 3 comments

hope additional backend inference support for Exllama. I have tested it, and Exllama is truly fast. Llama2-7b is twice as fast, while Llama2-70b can enhance speed by 50% to 80%.

zhangyuhanjc avatar Aug 25 '23 03:08 zhangyuhanjc

Yes, ExLlama seems very impressive. Do you know what kind of low-level interface they offer. LMQL requires full distribution access for masking and logprobs.

lbeurerkellner avatar Aug 25 '23 15:08 lbeurerkellner

Yes, ExLlama seems very impressive. Do you know what kind of low-level interface they offer. LMQL requires full distribution access for masking and logprobs.

Unfortunately i couldn't find any Exllama official guide about how to use logit_bias. https://github.com/turboderp/exllama/pull/104 But i found this issue, "zsmarty" seems found a solution. He wrote a function like this. hope it can give us some idea.

def generate_token_with_bias(self, prefix, logit_bias, startNewRequest = True):
        self.end_beam_search()

        if prefix and len(prefix) > 0:
            ids, mask = self.tokenizer.encode(prefix, return_mask = True, max_seq_len = self.model.config.max_seq_len)

            if (startNewRequest):
                self.gen_begin(ids, mask = mask)
            else:
                self.gen_feed_tokens(ids, mask)

        token = self.gen_single_token(logit_bias=logit_bias)

        text = self.tokenizer.decode(token)
        return text[0]

I am not very sure if it suitable for lmql.

zhangyuhanjc avatar Aug 25 '23 16:08 zhangyuhanjc

You can load GPTQ models using AutoGPTQ, which uses Exllama v2 by default (if installed).

modelpath = "TheBloke/vicuna-7B-v0-GPTQ" 
lmql.model(modelpath, tokenizer=modelpath, loader="auto-gptq",  
                   disable_exllamav2=False, use_safetensors=True
                   inject_fused_attention=False, inject_fused_mlp=False
)

After some testing it's pretty finnicky and needs tight constraints, but it's hard to tell whether its the backend or the quant itself.

bleugreen avatar Oct 22 '23 02:10 bleugreen