text-generation-inference
text-generation-inference copied to clipboard
Guidance acceleration
Feature request
Guidance can control the generated format, which could be a nice feature if it is built-in
- Add extra parameter to
/generate
and/generate_stream
protocol to specify a template - Use guidance-like mask to generate tokens and pre-fill known tokens, alternately
Motivation
See https://github.com/microsoft/guidance#guidance-acceleration-notebook
Your contribution
Just an idea for now
For now, won't it be better to add text-generation-inference as a supported LLM into Guidance, in the same way as it did for OpenAI's models?
Right, that's possible, but built-in support would "significantly improve inference performance".
When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by optimally reusing the Key/Value caches as we progress through the prompt. This means Guidance only asks the LLM to generate the green text below, not the entire program. This cuts this prompt's runtime in half vs. a standard generation approach.
Similar feature in another text generation project: https://github.com/ggerganov/llama.cpp/pull/1773
See also https://github.com/go-skynet/LocalAI/issues/354
Similar feature in another text generation project: ggerganov/llama.cpp#1773
I wonder konw if it can or not get a stream with batch process
This is one of our priority for the next release.
Do you happen to have any updates on this matter? Coming from LMQL
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.