lmql icon indicating copy to clipboard operation
lmql copied to clipboard

[Inference Backend] Enable attention Key/value caching

Open arnaudvl opened this issue 1 year ago • 6 comments

Is there a plan to incorporate key/value caching to improve generation efficiency significantly? See e.g. Guidance's acceleration.

arnaudvl avatar May 17 '23 16:05 arnaudvl

As far as I understand Guidance's approach, the key idea is to only call the LLM to complete template variables and not to re-generate the entire template. This form of acceleration, LMQL already implements since its very first release.

For instance in the query shown below, only the concrete variable values are actually predicted by the LLM, whereas the surrounding template is automatically inserted by the runtime. The number of LLM calls/forward passes required to run the query, correspond to exactly the number of actual value tokens that are completed in the template, not the template itself.

image

What is currently not possible, is to provide default values that you can pass to avoid re-generation of already known values, e.g. if you already have a value for DESCRIPTION in the query above. However, we plan to add this in a future release, as it has been brought up repeatedly by the community.

Does this answer you question? What aspect of Guidance's key/value caching would you be looking for in LMQL? We are very interested in feedback and ideas surrounding this.

lbeurerkellner avatar May 18 '23 09:05 lbeurerkellner

Hi @lbeurerkellner , thanks for the quick response. I am referring to caching the LLM's key/value attention pairs for sequential variable value generation. So for instance in the above example, first the LLM populates the ID. For the next call it generates DESCRIPTION. At this point you already have computed the LLM's key/value pairs for the template up until ID, and should not have to recompute this. The longer the template to be filled in becomes, the more important (faster + cheaper) this caching becomes. For production use cases, this is a very significant benefit. Check this post for more detail. Of course (for now) this is only possible for self-hosted models, not external API calls such as using OpenAI models.

arnaudvl avatar May 18 '23 11:05 arnaudvl

I see, thanks for the article, this clarifies it for me. I was primed on the wrong abstraction level, not thinking of transformer internals.

Yes, this is definitely on the list of things we plan to do. All in all, a much deeper integration on the inference side (beyond just token masking) is possible and something we are working on. HuggingFace already provides a lot of the infrastructure for this, but we are are also exploring other options, especially python-independent inference backends.

Multi-part prompts in general should probably be more transparent from the inference side, to enable cross-call optimizations. Happy to also hear your thoughts, what other optimizations might be interesting from a production perspective.

lbeurerkellner avatar May 18 '23 13:05 lbeurerkellner

The PagedAttention mechanism from vLLM may help https://github.com/vllm-project/vllm/blob/49b26e2cec8c56594668905e853fe4af34336b05/vllm/model_executor/layers/attention.py#L16

doxav avatar Jul 02 '23 22:07 doxav

any progress about this?

kongjiellx avatar Oct 30 '23 01:10 kongjiellx

There is a proof-of-concept implementation on a feature branch, but to make it work with batching, padding and multi-part prompting, it still requires some work. It may be worth to factor out support for non-batched KV caching for now.

lbeurerkellner avatar Nov 02 '23 11:11 lbeurerkellner