Breno Faria
Breno Faria
Let me summarize the issue raised by @saattrupdan: * when `n > 1`, all batches share **one** logits processor. E.g. in a chat completion request https://github.com/vllm-project/vllm/blob/4238bc82f24d5887784b04a353ed93e2360623b4/vllm/entrypoints/openai/serving_chat.py#L168 * the logits processor...
@njhill I like the direction of your proposal. This would allow us to invert control and get rid of the 10 lines setting up guided decoding in the `create_chat_completion` method....
PS: this PR is ready for review @simon-mo I'm just waiting for outlines to be released, so that we can get rid of the regression in the tests.
@maxdebayser thanks for looking into it! > we also found a problem with the FSM state being shared between sequences I have just looked into Outlines and while [`RegexGuide`](https://github.com/outlines-dev/outlines/blob/95f108e0824b8135c270087d4d09e25290efe619/outlines/fsm/guide.py#L135) is...
I have pushed a commit that comes close to what was there before and at the same time does not lead to crashes on n > 1. It's still not...
Please also consider the use-case of asymmetric embedding models (e.g. https://huggingface.co/intfloat/multilingual-e5-small). These models require the content to be embedded to be prefixed by "signal strings" that give the model the...
I strongly believe this is an issue with the state-machine cache that was fixed with this PR: https://github.com/outlines-dev/outlines/pull/911 @brandonwillard, what do you think?
@saattrupdan the logit processors provided in `outlines.integrations.vllm` can be replaced with those in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/guided_decoding/outlines_logits_processors.py.
I think so, yes.
ff-tokens are fast-forward tokens. When you are generating guided output, e.g. a json object, there are moments when you don't really need an LLM to generate the next tokens, because...