vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Support for Constrained decoding

Open ojus1 opened this issue 1 year ago • 1 comments

For getting structured outputs from custom-finetuned LLMs, extensive use of constrained decoding is standard.

Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near future? How would one go about implementing this in vLLM (if I were to add a PR)?

ojus1 avatar Jun 28 '23 09:06 ojus1

Hi! We very much welcome you to contribute to this feature! I believe You can add this functionality by modifying the following places:

  1. Add the related parameters to SamplingParams https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/sampling_params.py#L5
  2. Implement the logic in Sampler https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/model_executor/layers/sampler.py#L15
  3. To make our OpenAI frontend support this feature, add the related parameters to CompletionRequest https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/entrypoints/openai/protocol.py#L68 and add the parameter here when initializing SamplingPramams here: https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/entrypoints/openai/api_server.py#L130-L142

zhuohan123 avatar Jun 28 '23 15:06 zhuohan123

Curious if there's been any progress with this. I've hooked up Microsoft/Guidance and vLLM but the most powerful features aren't yet available because of missing features in vLLM.

Thank you!

zacharyblank avatar Jul 16 '23 20:07 zacharyblank

Related to #535

viktor-ferenczi avatar Sep 29 '23 06:09 viktor-ferenczi

Related topics:

  • #1191: Reliable JSON (or other structured data) generation
  • Integration with any of these libraries:
    • https://github.com/outlines-dev/outlines
    • https://github.com/guidance-ai/guidance
    • https://github.com/1rgs/jsonformer

viktor-ferenczi avatar Sep 30 '23 07:09 viktor-ferenczi

I'm going to implement this.

viktor-ferenczi avatar Sep 30 '23 16:09 viktor-ferenczi

I'd like to help implement this aswell

MaxZabarka avatar Sep 30 '23 16:09 MaxZabarka

The constraint may change during the generation. For example in case of #1191 it depends on what the JSON schema allows for the next token, but that depends on where the generation is currently in the schema. We cannot use the same constraint over the whole sequence in the general case. It must also work for beam search. How can we handle that efficiently via a REST API?

viktor-ferenczi avatar Sep 30 '23 16:09 viktor-ferenczi

I think in case of the REST API we could allow for passing a formal description of the constraint in some generic and de facto standard format (if we can talk about it this soon) like guidance. It would allow for "compiling" the constraint inside the server and applying it to all generation of that sequence, including beam search.

In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.

viktor-ferenczi avatar Sep 30 '23 16:09 viktor-ferenczi

Supporting the outlines library seems to be the best approach, because:

Outlines is compatible with all models. It only interfaces with models via the next-token logits. It can be used with API-based models as well.

While jsonformer is limited only to JSON and guidance does not have a clear way to integrate (it has spaghetti code).

viktor-ferenczi avatar Sep 30 '23 17:09 viktor-ferenczi

In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.

This might be inefficient when generating structured data, for example a format like JSON, where a significant portion of the output consists of predetermined fields and symbols. Manipulating logits after a token is generated would be wasteful because we would already know what the majority of tokens are be before generation.

A feature of guidance is that it avoids running generation for tokens that are already known. Given that speed and efficiency is important to vLLM, how would we go about implementing something like this when integrating outlines or another framework?

MaxZabarka avatar Sep 30 '23 18:09 MaxZabarka

Let's separate the two features:

  1. Ability to constrain the token generated (manipulate logits before the token is chosen)
  2. Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)

Since these features are largely independent I suggest implementing them in the above order.

viktor-ferenczi avatar Oct 01 '23 09:10 viktor-ferenczi

Minimal prototype: #1243

viktor-ferenczi avatar Oct 01 '23 17:10 viktor-ferenczi

This could be implemented by finishing LMQL integration.

viktor-ferenczi avatar Oct 17 '23 01:10 viktor-ferenczi

As I understand, guidance uses the logit_bias parameter to work. Would it be this PR enough? #535

I haven't tested yet but I think this is the way

Vokturz avatar Oct 19 '23 15:10 Vokturz

+1 to support logit_bias and allow libraries like guidance to utilize. Though there's a workaround to use vLLM API Server to mock ChatGPT API and use guidance openAI client to call, the performance downgraded a lot compared with logit_bias enabled output control.

nullpointer0xffff avatar Oct 27 '23 18:10 nullpointer0xffff

2. Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)

How would you think about creating this? since the sampler is running only after the forward pass.. the logits selector is already implemented and merged by @noamgat

flexorRegev avatar Nov 07 '23 22:11 flexorRegev

LM Format Enforcer is a library that achieves JSON Schema decoding and supports vLLM. There is already a sample notebook showing vLLM integration. It currently uses monkeypatching, that will be removed when the next vLLM version with the logits processing API will be released.

(Disclosure: I am the author of the library)

noamgat avatar Nov 08 '23 06:11 noamgat

@noamgat Thank you very much, it is very useful.

Support via the vLLM REST API would still be great, because it would save the model loading time by using a continuously running server.

See also #1279

viktor-ferenczi avatar Nov 09 '23 11:11 viktor-ferenczi

Outlines author here. The PR https://github.com/outlines-dev/outlines/pull/366 will allow easy integration into vLLM. Estimated time of completion is next week.

See https://github.com/outlines-dev/outlines/issues/163#issuecomment-1820441504 for a diagram that summarizes the new architecture. We can work together on the integration and finding the boundary that makes the most sense for both libraries.

rlouf avatar Nov 24 '23 18:11 rlouf

@rlouf did you manage to make much progress yet?

pj-ml avatar Jan 09 '24 13:01 pj-ml

Yes: https://outlines-dev.github.io/outlines/reference/vllm/

More is coming (soon)!

rlouf avatar Jan 09 '24 13:01 rlouf

Yes: https://outlines-dev.github.io/outlines/reference/vllm/

More is coming (soon)!

We need a similar solution integrated into vLLM by default.

I would suggest just porting over GBNF, since RegEx cannot be fully supported (also too complex) and JSON schema is too restrictive for simple use cases.

viktor-ferenczi avatar Jan 11 '24 07:01 viktor-ferenczi

Outlines' reference implementation of the vLLM server (https://github.com/outlines-dev/outlines/blob/main/outlines/serve/serve.py) is a copy of vLLM's https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py with a few patches and add-ons.

I believe this code should rather live in vLLM instead of outlines. And there should be an analogous implementation of the OpenAI endpoint.

@viktor-ferenczi, do you think this is a promising path? I'd be willing to invest time into this.

br3no avatar Feb 05 '24 10:02 br3no

I think it is something to be decided by the maintainers of the outlines and the vLLM projects.

Currently both projects are changing rapidly and have quite a few bugs, so maybe this is something to decide later as they stabilize.

I'm just a small contributor / user, not a decision maker here.

viktor-ferenczi avatar Feb 06 '24 15:02 viktor-ferenczi

@viktor-ferenczi, fair enough.

@zhuohan123 and @rlouf, what is your assessment?

br3no avatar Feb 07 '24 08:02 br3no

I think it would make sense, vLLM benefits from structured generation and Outlines can re-focus on its main goals.

rlouf avatar Feb 07 '24 11:02 rlouf

It would be nice to have constrained decoding out of the box, because how it goes right now I have to fix bugs to get it working with outlines after every single vLLM update. Just to see those fixes being deleted because of yet another round of changes.

viktor-ferenczi avatar Feb 08 '24 20:02 viktor-ferenczi

I just read about SGLang's approach for constrained decoding. Did you consider adding that to VLLM instead of Outlines? See for example this blog article: https://lmsys.org/blog/2024-02-05-compressed-fsm/

scriptator avatar Feb 09 '24 10:02 scriptator

SGLang's code was copied from Outlines', they just decided to import Outlines instead and implemented the change. See also this blog post that was published prior to theirs and explains the limits of a character-based approach.

rlouf avatar Feb 09 '24 12:02 rlouf

We now support full range of constrained/guided decoding as powered by Outlines, closing this as completed

simon-mo avatar Mar 19 '24 22:03 simon-mo