Prefix Caching Does Not Require Prompt Template and Other Ambiguities

Open ethan-digi opened this issue 1 year ago • 0 comments

I'm having difficulty understanding how to approach prompt-related performance tuning due to ambiguity on how TensorRT's prefix caching works.

My current understanding of prefix caching as implemented in TensorRT, based on the description in Issue 620, is the following: Blocks shared between prompts will be re-used in later inferences using those prompts. Only issue is, how does this even differ from normal kv caching? Is there some sort of special logic applied to identify frequently reused blocks? My understanding is that, unless a block is evicted from the cache, it should automatically remain in memory to be reused, meaning that prefix caching should function by default when using PagedAttention. I suppose the change to enable prefix caching could have been just disabling any automatic eviction policies that were used prior.

What makes this even more confusing is that in the PagedAttention paper, prefix sharing is laid out to function as such:

For this type of application, many user prompts share a prefix, thus the LLM service provider can store the KV cache of the prefix in advance to reduce the redundant computation spent on the prefix ... this can be conveniently achieved by reserving a set of physical blocks for a set of predefined shared prefixes by the LLM service provider

This would indicate that in order to receive the benefits of prefix caching, that we need to submit a prompt template, which is not mentioned in the documentation. So it would seem there's an automatic system.

I'm wondering for the broader reason of understanding how slight changes in tokens in the prompt will affect performance.

Consider a prompt structured as: "Please ask [user] how their day is going. Be sure to greet them by name". If [user] changes every request, will the entire prompt be thrown out and regenerated, or if, supposing for sake of example, blocks contain of 2 tokens and each word is one token, would everything except "[user] how" be kept, and only that block reloaded?

May 16 '24 21:05 ethan-digi