llm Prompt caching

This is another feature I'd be keen to see included, and would be happy to help with. I've noticed that caching is recorded as a capability, so figured I'd check if you have existing plans.

Feb 03 '25 21:02 ultronozm

This is something I've thought about, but it's a bit tricky because there's a pretty wide variation in how caching works amongst the big providers. If you have some ideas, let me know, otherwise I'll just have to devote some time to thinking and experiment with some possible solutions. There's no guarantee that anything reasonable will come out, though.

Feb 04 '25 05:02 ahyatt

What about providing low level provider functions for every provider first without providing general abstraction? After that we can think if we can make an abstraction on top of it.

Feb 16 '25 18:02 s-kostyaev

What about providing low level provider functions for every provider first without providing general abstraction? After that we can think if we can make an abstraction on top of it.

It's been on my TODO for a while to check this, but I think the non-standard-params argument is intended (or at least can be used) for this sort of experimentation?

Feb 19 '25 14:02 ultronozm

I should have some time this weekend to experiment with this. It might be possible to use non-standard-params for this, but it could be that something else makes more sense.

Feb 22 '25 16:02 ahyatt

I think the best way to go about this would be to add a cached-context slot in llm-chat-prompt. If filled, each provider can do the right thing with it:

Open AI can simply prefix it to the prompt, since no special change to the request is needed, and the prefix is cached automatically.
Claude can insert it in the beginning of the system prompt and mark it as cacheable.
Gemini / Vertex can upload it separately. This is likely the weirdest one; we don't currently ever do two requests in one llm call. But it would need to store a special value in the prompt, perhaps requiring another slot for miscellaneous key/value pairs.

Feb 24 '25 04:02 ahyatt

I like the cached-context idea. Two quick thoughts:

For Claude, tools should be cached by default since they come first in the order.
For conversation threads, maybe add a simple "cache-after" parameter to specify message indexes where cache markers should be placed?

This would cover the common cases - cache tools by default, use cached-context for the system prompt, and cache-after to mark specific points in conversation history.

Feb 28 '25 10:02 ultronozm

So I think if there is a cached-context, yes, tools should be cached in Claude. For caching messages, it is possible, but I think there's enough discrepancy in whether this is supported or not that I'd like to hold off on this for now.

If I had some indication when the client creates the prompt that this is probably going to be used for several rounds of conversation, we could just enable caching in a way appropriate to each provider, caching essentially everything in the initial request that we can. Maybe this is a more sensible way of doing things. I'm going to think on this a bit more.

Mar 07 '25 05:03 ahyatt

So I think if there is a cached-context, yes, tools should be cached in Claude. For caching messages, it is possible, but I think there's enough discrepancy in whether this is supported or not that I'd like to hold off on this for now.

If I had some indication when the client creates the prompt that this is probably going to be used for several rounds of conversation, we could just enable caching in a way appropriate to each provider, caching essentially everything in the initial request that we can. Maybe this is a more sensible way of doing things. I'm going to think on this a bit more.

Given that my motivation is to reduce token usage with Sonnet, I'm inclined to modify https://github.com/ultronozm/ai-org-chat.el so that a :CACHE_CONTROL: property at any node has the expected effect. It seems like the simplest way to do this would be to roll my own variants of llm-chat-streaming and llm-provider-chat-request from llm-claude.el, giving some way to specify at which conversation points to introduce cached_control's (e.g., a list of t's and nil's, corresponding to message parts). This might be a bit hacky, so let me know if you have other ideas.

For a more automatic solution, I guess the obvious thing to try would be adding a cache_control immediately following any large media type (pdf or image) or in in the conversation at the first point where the token count exceeds some fraction of the maximum.

Mar 10 '25 15:03 ultronozm

I think the question is what parts don't you want to cache? I think if we had an indication a client wanted to cache things, we can and maybe should just cache everything possible, which fits your use-case as far as I understand it. You wouldn't want to cache things that would change, but the prompt just gets added to, so I think after the first round with the LLM it would become fixed to that initial prompt.

Mar 12 '25 19:03 ahyatt

I agree with caching as much as possible, but unless I've misunderstood the docs, "caching every time" is less efficient than not caching at all. The reason is that with each cache insertion, Claude reprocesses the whole prefix and charges a 25% premium.

Without caching, with N messages, each message gets processed about N/2 times, giving a cost of roughly N*N/2 (assuming messages of equal length, for simplicity).

With C caches evenly spaced among N messages: on average, each of the C caches processes about N/2 messages with a 25% premium, while each of the N exchanges processes about (N/C)/2 messages. The total cost is (1.25 * C*N + N*N/C)/2.

Caching every message means taking C=N, and so is more expensive than not caching at all.

A doubling strategy (like for vector reallocation) means C=log(N), which also doesn't save much.

A good strategy for this toy model would be to take C = sqrt(N) on average, e.g., if one cache comes after message n, then the next should come after message n + ceil(sqrt(n)).

In practice, messages are of different sizes: we start with some large context (text or other media) and then want discuss it. It then makes sense to cache again only when the cost of reprocessing the accumulated discussion exceeds that of recaching the context, e.g., as soon as

(tokens processed since last cache, with repetition) > c * (tokens in last cache)

where c is some parameter we could optimize (e.g., c=1).

Other points to remember are the minimum cache size of 1024 and the 5-minute TTL. These suggest that we should always cache if we're above the minimum and at least 5 minutes have elapsed since the last cache use.

Mar 17 '25 00:03 ultronozm

My last comment might now be out of date: https://www.anthropic.com/news/token-saving-updates. I'll need to look more carefully at https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching at some point

Added later: from my first reading, this update doesn't seem to affect the earlier discussion, but it'd be good to test things out to be sure

Mar 19 '25 17:03 ultronozm

Thanks for the useful analysis on this! After thinking about this some more, I'm coming back around to my original proposal, because really only the client may know what parts of the context is going to common to multiple queries. Caching based on the conversation itself is nice, but as you show, it's not trivial to optimize, and probably not as useful as caching these common prefixes and tools across multiple API calls.

My previous solution had a good solution on caching the context, but not the tools. Perhaps what we can say is that we if we cache the context at all, we'll also cache the tools if possible.

Mar 20 '25 04:03 ahyatt

I agree that your original proposal is a good compromise between simplicity and practicality, but still think I'd find it useful to be able to cache part of the conversation. The llm library doesn't need to solve the problem of determining precisely when to cache -- this can be left to front-end packages. Happy to help w/ implementation on this

Re. your original proposal, one thing that would make this more effective is if the context were multipart, so that one can, e.g., chat about an image or a PDF.

On Wed, Mar 19, 2025 at 9:45 PM Andrew Hyatt @.***> wrote:

Thanks for the useful analysis on this! After thinking about this some more, I'm coming back around to my original proposal, because really only the client may know what parts of the context is going to common to multiple queries. Caching based on the conversation itself is nice, but as you show, it's not trivial to optimize, and probably not as useful as caching these common prefixes and tools across multiple API calls.

My previous solution had a good solution on caching the context, but not the tools. Perhaps what we can say is that we if we cache the context at all, we'll also cache the tools if possible.

— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/150#issuecomment-2739137460, or unsubscribe https://github.com/notifications/unsubscribe-auth/APC5ZXOWIKU27N227VTEHPD2VJBU3AVCNFSM6AAAAABWNDDQSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMZZGEZTONBWGA . You are receiving this because you authored the thread.Message ID: @.***> [image: ahyatt]ahyatt left a comment (ahyatt/llm#150) https://github.com/ahyatt/llm/issues/150#issuecomment-2739137460

Thanks for the useful analysis on this! After thinking about this some more, I'm coming back around to my original proposal, because really only the client may know what parts of the context is going to common to multiple queries. Caching based on the conversation itself is nice, but as you show, it's not trivial to optimize, and probably not as useful as caching these common prefixes and tools across multiple API calls.

My previous solution had a good solution on caching the context, but not the tools. Perhaps what we can say is that we if we cache the context at all, we'll also cache the tools if possible.

— Reply to this email directly, view it on GitHub https://github.com/ahyatt/llm/issues/150#issuecomment-2739137460, or unsubscribe https://github.com/notifications/unsubscribe-auth/APC5ZXOWIKU27N227VTEHPD2VJBU3AVCNFSM6AAAAABWNDDQSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMZZGEZTONBWGA . You are receiving this because you authored the thread.Message ID: @.***>

Mar 20 '25 04:03 ultronozm