lmql
lmql copied to clipboard
Feature Request: `past_key_values` landed in `transformers`, and could speed up generations
Transformers can now return past_key_values
which can be used to speed up future calls:
https://github.com/huggingface/transformers/pull/25086
I've noticed in experiments that when I build a query function in LMQL, I thought I'd read in the docs that it caches previous runs to speed up future runs when the query text has a common prefix, but alas, timing it it was the same whether I built a query function once and reused it, or rebuilt it for each separate query.
Now, transformers will allow this speed up pretty trivially!
You can refer to the Guidance for the implementation of external support for key-value (KV) cache in the generate method for the transformer. here is the link (https://github.com/guidance-ai/guidance/blob/302a240b35b51a8626bfb7f8b9beb28fc6359bf4/guidance/llms/_transformers.py#L126) In fact, you can directly copy the code from the guidance with minimal modifications. However, it is important to note that LMQL itself supports asynchronous batch computation, which means that the current computation progress cached behind each query may vary. If you want to use KV cache, batch computation cannot be used, and can only be performed serially under batch size=1. In fact, it is possible to simultaneously support batch computation and KV cache in LMQL, but it would be very cumbersome. Each generate would require the reassembly of the KV cache and adjusting their current computation progress, as well as handling padding for batch computation, which would involve both left and right shifts. This process is very cumbersome, prone to errors, and difficult to abstract into a common solution to support all backends. I think it's possible that the author has temporarily not supported KV cache due to these reasons.
Yes, this is precisely what is delaying KV caching support currently. We want to provide full batched support, but a simple non-batched variant may make it to main before then.
The caching in the docs refers to caching on different levels, like during decoding, also across multiple runs.
It’s good to know that there is now official support in transformers, this will make the implementation easier, not requiring the hacks they have in guidance.