lmql icon indicating copy to clipboard operation
lmql copied to clipboard

Integrate vLLM

Open ChezzPlaya opened this issue 1 year ago • 13 comments

Are you planning to integrate the vLLM package for fast LLM inference and serving?

https://vllm.readthedocs.io/en/latest/

ChezzPlaya avatar Aug 01 '23 08:08 ChezzPlaya

Yes, we definitely want to add a corresponding LMTP backend. However, we will wait until vLLM adds logit_bias support, which is crucial to make LMQL's constraining work. See also the vLLM GH for progress on that:

  • https://github.com/vllm-project/vllm/issues/244#Frontend%20Features
  • https://github.com/vllm-project/vllm/issues/379

lbeurerkellner avatar Aug 01 '23 09:08 lbeurerkellner

I know it's on their roadmap, but am I wrong in thinking that this (https://github.com/vllm-project/vllm/blob/acbed3ef40f015fcf64460e629813922fab90380/vllm/model_executor/layers/sampler.py#L94C33-L94C33) is logit bias?

It looks like it's already implemented in there.

benbot avatar Oct 06 '23 03:10 benbot

Any news regarding this? It would really help with non-openai models

maximegmd avatar Nov 02 '23 19:11 maximegmd

I became aware of LMFE today (from newsletter from llama-index). Seems similar to lmql in some ways, although maybe a different approach in ways I can't determine from cursory look. I did notice that they have a way of integrating with vllm (although doesn't look like the cleanest approach). Not sure if applicable here: https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_vllm_integration.ipynb

jhallas avatar Nov 15 '23 00:11 jhallas

vllm added support for logit processor in 0.2.2, see https://github.com/vllm-project/vllm/pull/1469

giorgiopiatti avatar Nov 25 '23 11:11 giorgiopiatti

Hi! vllm could probably be integrated in different ways, e.g.:

  1. vllm client that connects to a vllm server that's set up independently of lmql (like: lmtp_replicate_client)
  2. vllm backend that's part of lmtp model serving (like: llama_cpp_model)
  3. use lmtp openai interface to access vllm proxy openai server (#250 ; but logits_bias not ready yet?)

As a user, I'm leaning towards 1, being clean and simple.

What are your preferences?

@lbeurerkellner , what are your thoughts?

(Also pinging: @reuank)

ggbetz avatar Dec 14 '23 07:12 ggbetz

I can see the appeal of 1 to the user (no extra server-side setup and possibility of using third-party infrastructure) and we can definitely support it. However, 2 is the best option from our perspective, as we have some deep server-side improvements coming up, that specifically optimize constrained decoding and LMQL programs, something vLLM is not focusing on of course. Running with vLLM in the same process, will allow a deeper and better performing implementation.

I only strongly oppose 3, as I think the OpenAI API is the most limiting to us, and not something I would want to invest further in, with respect to protocol etc.

lbeurerkellner avatar Dec 14 '23 16:12 lbeurerkellner

Thanks, that makes sense to me. Is it correct that implementing option 2 essentially involves the following steps?

  • Create vllm_model module in models/lmtp/backend
  • Create class VllmModel(LMTPModel) (in analogy to LlamaCppModel)
  • Implement methods with vllm offline inference
  • Register model class

ggbetz avatar Dec 15 '23 07:12 ggbetz

That makes sense to me and and would provide a clean path forward. Do we get all the benefits of VLLM when its used in offline inference? Paged attention for example. I would just use LLama.cpp backend, but in a more GPU rich environment with many concurrent users VLLM outperforms.

wdhitchc avatar Dec 15 '23 18:12 wdhitchc

Do we get all the benefits of VLLM when its used in offline inference?

Good question. vllm docstring explains:

[LLMEngine] is the main class for the vLLM engine. It receives requests from clients and generates texts from the LLM. It includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). This class utilizes iteration-level scheduling and efficient memory management to maximize the serving throughput.

The LLM class wraps this class for offline batched inference and the AsyncLLMEngine class wraps this class for online serving.

Accordingly, if I'm not mistaken, the answer is yes and we'd get all the benefits with offline inference. (Experts, please correct me if I'm wrong.)

ggbetz avatar Dec 16 '23 09:12 ggbetz

As a user, I'd strongly prefer option 3., as that would allow me to seamlessly switch between OpenAI and vLLM in my application. It would also let me run a server for many uses, independent of LMQL, instead of having it all in one monolith.

jbohnslav avatar Dec 22 '23 14:12 jbohnslav

Hey @lbeurerkellner, are you aware of anyone currently working on this? Otherwise, I will have a look at the approach @ggbetz described (adding a new vLLM backend, similar to llama_cpp_model).

reuank avatar Jan 11 '24 13:01 reuank

I am not aware of anyone actively working on this, so feel free to go ahead :)

lbeurerkellner avatar Jan 16 '24 20:01 lbeurerkellner