haystack Agent should support prompt caching

Is your feature request related to a problem? Please describe. In Agents with many tool calls, input tokens can accumulate quickly. Prompt caching (when supported by the LLM) can heavily reduce costs for this use case. The agent should automatically designate chat messages to be cached if the user enables caching.

Describe the solution you'd like The AnthropicChatGenerator already supports reading cache control from the ChatMessage meta. If the agent would set that on the messages, we could use caching. I think we need to experiment if it makes sense to set it always on the latest message or only on some messages or if we can leave the decision to the user.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Apr 24 '25 18:04 mathislucka

For context, link to an example for prompt caching: https://github.com/deepset-ai/haystack-core-integrations/pull/1300/files#diff-c9173d1750430b1d796e26865cf0ca1fb91a9ec7016ff989aab26f3fd949eb62R62

Apr 25 '25 13:04 julian-risch

I believe we should consider implementing caching at the tool level, not just at the message/prompt level. Since most agent interactions rely on repeated tool invocations with the same inputs (especially in deterministic environments), caching the tool call inputs and their corresponding outputs could significantly reduce both cost and latency.

This could start as a simple input-output cache per tool, and then evolve into a fully integrated caching layer that sits between the agent and its tools. The LLM prompt would then only be recomputed if the tool results changed or were uncached.

This approach complements the current prompt-level caching and helps address multi-tool agent workflows where the LLM acts more as an orchestrator than a pure reasoner.

Apr 30 '25 09:04 YassinNouh21

Just to add a thought here — in addition to prompt-level caching, I believe there's strong value in implementing tool-level caching, especially since repeated tool calls (with the same arguments) can be common in multi-step agent workflows.

One possible approach is to extend the ToolInvoker with a lightweight caching layer that uses the tool name + arguments as a deterministic cache key. This could store results in the document store and serve cached responses transparently to the agent, avoiding unnecessary executions.

What do u think ?

Apr 30 '25 09:04 YassinNouh21

I think tool-level caching might add unwanted complexity. Let's take an Agent that can call the GitHub API to view file contents by passing repo/owner and path. The API call does not cost anything and it should resolve in under 500ms. This means there would be little benefit in caching the result. On the other hand, how would we handle cache invalidation? Let's assume the agent can also edit files. If the agent would view a file, then edit, then view again later, we should not serve the content from cache. Granted, we could invalidate the cache upon edit, but what about cases where the change to the file does not originate from the Agent itself?

Same would go for many other tools. Imagine a weather API: The weather changes, so even with the same tool call arguments, the new result can differ from the cached result. How should we decide when to invalidate the cache?

@YassinNouh21 can you perhaps give some examples of expensive tool calls (cost and latency) that would warrant the complexity of building a caching layer and handling cache invalidation?

Apr 30 '25 09:04 mathislucka

@mathislucka

Great points on cache invalidation — dynamic sources like GitHub or weather APIs definitely require careful handling. That said, tool-level caching can still be very useful when applied selectively.

For example, frameworks like CrewAI support this with custom cache_functions per tool, allowing fine-grained control over what gets cached and for how long. Tools like SerperDevTool (for search) or LLM-based summarization tools benefit a lot — they can cost and introduce latency (1–5s). In contrast, you'd simply disable caching for fast, dynamic tools like GitHub or weather.

So rather than caching everything, the idea is to make caching opt-in per tool, with strategies like TTL-based invalidation or result-aware logic

ref crewai_doc crewai_cache_func

Apr 30 '25 09:04 YassinNouh21

Just had a quick glance at the crewai implementation and I got the impression that this does not handle production-grade caching.

The cache is a simple dict that is kept in-memory. This might work for a single uninterrupted agent run but breaks as soon as you would run an Agent that scales horizontally with multiple instances running behind a load balancer.

It is also not possible to use TTL-based invalidation with this simple dict approach. Can you give an example of how to achieve time-based cache invalidation with crewai?

I think a better approach is to leave any caching decisions to the user who is implementing the Agent. This might be as simple as this:

search_comp = SerperDevWebSearch()

@lru_cache
def search(query: str) -> list[Document]:
  return search_comp.run(query)

search_tool = Tool(name="search_tool", function=search)

This should work for basic use cases. If users have advanced caching needs, they could integrate a proper caching solution like Redis into their tools.

Apr 30 '25 10:04 mathislucka

@mathislucka I agree

Apr 30 '25 11:04 YassinNouh21