lmql icon indicating copy to clipboard operation
lmql copied to clipboard

Add support for HF Inference endpoints via text-generation-inference / EasyLLM

Open ggbetz opened this issue 2 years ago • 2 comments

Hi Luca! It's excellent that one can use 🔥LMQL with locally served open models. Now, it would be handy if we could use open LLMs running in the cloud as easily as OpenAI's models. E.g.:

argmax 
    "Say 'this is a test':[RESPONSE]" 
from 
    "hf:meta-llama/Llama-2-70b-chat-hf" 
where 
    len(TOKENS(RESPONSE)) < 10

@philschmid has this EasyLLM and I wonder whether it might be pretty straightforward to implement that feature with easyllm?

If you think it's worthwhile, too, some pointers about where to start and a short instruction/plan for implementing that feature would be very welcome.

Cheers, Gregor

ggbetz avatar Aug 21 '23 09:08 ggbetz

Hi Gregor, thanks for the suggestions. Additional backends are always welcome. We recently added first support for replicate.com which does allow you to run open models in the cloud (@charles-dyfis-net is working on this).

After a brief look, EasyLLM seems promising and could be a way to switch out OpenAI models for other models, mocking the same interface. However, looking at https://github.com/philschmid/easyllm/blob/main/easyllm/schema/openai.py#L60, EasyLLM seems to lack support for the most crucial feature that LMQL requires to operate at full feature set, i.e. the logit_bias parameter.

logit_bias is what allows LMQL to guide the model during text generation according to the query program and constraints. Unfortunately, I have found with most projects that implement OpenAI-like APIs, that none of them so far implement it to full faithfulness (e.g. streaming, batching + logit bias), required for LMQL compatibility. I admit, keeping up and achieving good levels of OpenAI API compliance is hard and a moving target, but unfortunately a requirement for use with LMQL.

Now, since OpenAI APIs change a lot, deprecate fast and are also increasingly proprietary+intransparent (vendor lock-in), we decided to move away from relying too much on them, and instead build our own (open) protocol for efficient language model streaming. For this, we implement the language model transport protocol. All non-OpenAI backends in LMQL have moved to this protocol in the meantime and I would suggest this as a good starting point to build new backends (consider the random model for a simple reference implementation). We could already add simple (non-constraint-enabled) HuggingFace text-generation-inference support, but are still waiting for logit_bias support on their end, cf. https://github.com/huggingface/text-generation-inference/pull/810).

lbeurerkellner avatar Aug 21 '23 14:08 lbeurerkellner

Thanks for the detailed explanation, Luca. I understand. So let's wait for logit_bias being implemented in TGI (just upvoted the PR).

ggbetz avatar Aug 21 '23 15:08 ggbetz