vllm
vllm copied to clipboard
Support for Constrained decoding
For getting structured outputs from custom-finetuned LLMs, extensive use of constrained decoding is standard.
Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near future? How would one go about implementing this in vLLM (if I were to add a PR)?
Hi! We very much welcome you to contribute to this feature! I believe You can add this functionality by modifying the following places:
- Add the related parameters to
SamplingParams
https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/sampling_params.py#L5 - Implement the logic in
Sampler
https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/model_executor/layers/sampler.py#L15 - To make our OpenAI frontend support this feature, add the related parameters to
CompletionRequest
https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/entrypoints/openai/protocol.py#L68 and add the parameter here when initializingSamplingPramams
here: https://github.com/vllm-project/vllm/blob/bdd6b4c8bc3e5ac93553436514171ffad5926f0c/vllm/entrypoints/openai/api_server.py#L130-L142
Curious if there's been any progress with this. I've hooked up Microsoft/Guidance and vLLM but the most powerful features aren't yet available because of missing features in vLLM.
Thank you!
Related to #535
Related topics:
- #1191: Reliable JSON (or other structured data) generation
- Integration with any of these libraries:
- https://github.com/outlines-dev/outlines
- https://github.com/guidance-ai/guidance
- https://github.com/1rgs/jsonformer
I'm going to implement this.
I'd like to help implement this aswell
The constraint may change during the generation. For example in case of #1191 it depends on what the JSON schema allows for the next token, but that depends on where the generation is currently in the schema. We cannot use the same constraint over the whole sequence in the general case. It must also work for beam search. How can we handle that efficiently via a REST API?
I think in case of the REST API we could allow for passing a formal description of the constraint in some generic and de facto standard format (if we can talk about it this soon) like guidance. It would allow for "compiling" the constraint inside the server and applying it to all generation of that sequence, including beam search.
In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.
Supporting the outlines library seems to be the best approach, because:
Outlines is compatible with all models. It only interfaces with models via the next-token logits. It can be used with API-based models as well.
While jsonformer is limited only to JSON and guidance does not have a clear way to integrate (it has spaghetti code).
In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.
This might be inefficient when generating structured data, for example a format like JSON, where a significant portion of the output consists of predetermined fields and symbols. Manipulating logits after a token is generated would be wasteful because we would already know what the majority of tokens are be before generation.
A feature of guidance is that it avoids running generation for tokens that are already known. Given that speed and efficiency is important to vLLM, how would we go about implementing something like this when integrating outlines or another framework?
Let's separate the two features:
- Ability to constrain the token generated (manipulate logits before the token is chosen)
- Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)
Since these features are largely independent I suggest implementing them in the above order.
Minimal prototype: #1243
This could be implemented by finishing LMQL integration.
As I understand, guidance uses the logit_bias
parameter to work. Would it be this PR enough? #535
I haven't tested yet but I think this is the way
+1 to support logit_bias
and allow libraries like guidance to utilize.
Though there's a workaround to use vLLM API Server to mock ChatGPT API and use guidance openAI client to call, the performance downgraded a lot compared with logit_bias
enabled output control.
2. Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)
How would you think about creating this? since the sampler is running only after the forward pass.. the logits selector is already implemented and merged by @noamgat
LM Format Enforcer is a library that achieves JSON Schema decoding and supports vLLM. There is already a sample notebook showing vLLM integration. It currently uses monkeypatching, that will be removed when the next vLLM version with the logits processing API will be released.
(Disclosure: I am the author of the library)
@noamgat Thank you very much, it is very useful.
Support via the vLLM REST API would still be great, because it would save the model loading time by using a continuously running server.
See also #1279
Outlines author here. The PR https://github.com/outlines-dev/outlines/pull/366 will allow easy integration into vLLM. Estimated time of completion is next week.
See https://github.com/outlines-dev/outlines/issues/163#issuecomment-1820441504 for a diagram that summarizes the new architecture. We can work together on the integration and finding the boundary that makes the most sense for both libraries.
@rlouf did you manage to make much progress yet?
Yes: https://outlines-dev.github.io/outlines/reference/vllm/
More is coming (soon)!
Yes: https://outlines-dev.github.io/outlines/reference/vllm/
More is coming (soon)!
We need a similar solution integrated into vLLM by default.
I would suggest just porting over GBNF, since RegEx cannot be fully supported (also too complex) and JSON schema is too restrictive for simple use cases.
Outlines' reference implementation of the vLLM server (https://github.com/outlines-dev/outlines/blob/main/outlines/serve/serve.py) is a copy of vLLM's https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py with a few patches and add-ons.
I believe this code should rather live in vLLM instead of outlines. And there should be an analogous implementation of the OpenAI endpoint.
@viktor-ferenczi, do you think this is a promising path? I'd be willing to invest time into this.
I think it is something to be decided by the maintainers of the outlines and the vLLM projects.
Currently both projects are changing rapidly and have quite a few bugs, so maybe this is something to decide later as they stabilize.
I'm just a small contributor / user, not a decision maker here.
@viktor-ferenczi, fair enough.
@zhuohan123 and @rlouf, what is your assessment?
I think it would make sense, vLLM benefits from structured generation and Outlines can re-focus on its main goals.
It would be nice to have constrained decoding out of the box, because how it goes right now I have to fix bugs to get it working with outlines after every single vLLM update. Just to see those fixes being deleted because of yet another round of changes.
I just read about SGLang's approach for constrained decoding. Did you consider adding that to VLLM instead of Outlines? See for example this blog article: https://lmsys.org/blog/2024-02-05-compressed-fsm/
SGLang's code was copied from Outlines', they just decided to import Outlines instead and implemented the change. See also this blog post that was published prior to theirs and explains the limits of a character-based approach.
We now support full range of constrained/guided decoding as powered by Outlines, closing this as completed