lmql
lmql copied to clipboard
Integrate vLLM
Are you planning to integrate the vLLM package for fast LLM inference and serving?
https://vllm.readthedocs.io/en/latest/
Yes, we definitely want to add a corresponding LMTP backend. However, we will wait until vLLM adds logit_bias support, which is crucial to make LMQL's constraining work. See also the vLLM GH for progress on that:
- https://github.com/vllm-project/vllm/issues/244#Frontend%20Features
- https://github.com/vllm-project/vllm/issues/379
I know it's on their roadmap, but am I wrong in thinking that this (https://github.com/vllm-project/vllm/blob/acbed3ef40f015fcf64460e629813922fab90380/vllm/model_executor/layers/sampler.py#L94C33-L94C33) is logit bias?
It looks like it's already implemented in there.
Any news regarding this? It would really help with non-openai models
I became aware of LMFE today (from newsletter from llama-index). Seems similar to lmql in some ways, although maybe a different approach in ways I can't determine from cursory look. I did notice that they have a way of integrating with vllm (although doesn't look like the cleanest approach). Not sure if applicable here: https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_vllm_integration.ipynb
vllm added support for logit processor in 0.2.2, see https://github.com/vllm-project/vllm/pull/1469
Hi! vllm could probably be integrated in different ways, e.g.:
- vllm client that connects to a vllm server that's set up independently of lmql (like: lmtp_replicate_client)
- vllm backend that's part of lmtp model serving (like: llama_cpp_model)
- use lmtp openai interface to access vllm proxy openai server (#250 ; but logits_bias not ready yet?)
As a user, I'm leaning towards 1, being clean and simple.
What are your preferences?
@lbeurerkellner , what are your thoughts?
(Also pinging: @reuank)
I can see the appeal of 1 to the user (no extra server-side setup and possibility of using third-party infrastructure) and we can definitely support it. However, 2 is the best option from our perspective, as we have some deep server-side improvements coming up, that specifically optimize constrained decoding and LMQL programs, something vLLM is not focusing on of course. Running with vLLM in the same process, will allow a deeper and better performing implementation.
I only strongly oppose 3, as I think the OpenAI API is the most limiting to us, and not something I would want to invest further in, with respect to protocol etc.
Thanks, that makes sense to me. Is it correct that implementing option 2 essentially involves the following steps?
- Create vllm_model module in models/lmtp/backend
- Create
class VllmModel(LMTPModel)
(in analogy toLlamaCppModel
) - Implement methods with vllm offline inference
- Register model class
That makes sense to me and and would provide a clean path forward. Do we get all the benefits of VLLM when its used in offline inference? Paged attention for example. I would just use LLama.cpp backend, but in a more GPU rich environment with many concurrent users VLLM outperforms.
Do we get all the benefits of VLLM when its used in offline inference?
Good question. vllm docstring explains:
[LLMEngine] is the main class for the vLLM engine. It receives requests from clients and generates texts from the LLM. It includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). This class utilizes iteration-level scheduling and efficient memory management to maximize the serving throughput.
The
LLM
class wraps this class for offline batched inference and theAsyncLLMEngine
class wraps this class for online serving.
Accordingly, if I'm not mistaken, the answer is yes and we'd get all the benefits with offline inference. (Experts, please correct me if I'm wrong.)
As a user, I'd strongly prefer option 3., as that would allow me to seamlessly switch between OpenAI and vLLM in my application. It would also let me run a server for many uses, independent of LMQL, instead of having it all in one monolith.
Hey @lbeurerkellner, are you aware of anyone currently working on this? Otherwise, I will have a look at the approach @ggbetz described (adding a new vLLM backend, similar to llama_cpp_model).
I am not aware of anyone actively working on this, so feel free to go ahead :)