Anthropic's prompt caching in langchain does not work with ChatPromptTemplate.
URL
https://python.langchain.com/docs/how_to/llm_caching/
Checklist
- [X] I added a very descriptive title to this issue.
- [X] I included a link to the documentation page I am referring to (if applicable).
Issue with current documentation:
I have not found any documentation for prompt caching in the langchain documentation. There seems to be only one post on twitter regarding prompt caching in langchain. I am trying to implement prompt caching in my rag system. I am using history aware retriever.
I have instantiated the model like this:
llm_claude = ChatAnthropic( model="claude-3-5-sonnet-20240620", temperature=0.1, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"} )
And using the ChatPromptTemplate like this:
contextualize_q_prompt = ChatPromptTemplate.from_messages( [ ("system", contextualize_q_system_prompt), ("human", "{input}"), ] )
I am not able to find a way to include prompt caching with this. I tried making the prompt like this, but still doesnt work.
prompt = ChatPromptTemplate.from_messages([ SystemMessage(content=contextualize_q_system_prompt, additional_kwargs={"cache_control": {"type": "ephemeral"}}), HumanMessage(content= "{input}") ])
Please help me with how I should enable prompt caching in langchain.
Idea or request for content:
Langchain documentation should be updated with how to use prompt caching with different prompt templates. And especially with a RAG system.
Hi, @raajChit. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.
Issue Summary:
- You raised a concern about the lack of documentation for implementing prompt caching in LangChain.
- Specifically, you are looking for guidance on using
ChatPromptTemplatewith Anthropic's models. - Your goal is to integrate prompt caching into a retrieval-augmented generation (RAG) system.
- You suggested updating the LangChain documentation to include instructions for different prompt templates.
- There have been no further comments or developments on this issue.
Next Steps:
- Please let us know if this issue is still relevant to the latest version of the LangChain repository by commenting here.
- If there is no further activity, this issue will be automatically closed in 7 days.
Thank you for your understanding and contribution!
This is a major issue. Without caching it's just too slow and too expensive. I tried:
- Adding additional_kwargs={"cache_control": {"type": "ephemeral"}
- Adding {"cache_control": {"type": "ephemeral"} to the message and caching doesn't happen.
The only thing that does get cached is the system prompt.
@eyurtsev, the user has indicated that the issue regarding prompt caching is still a major concern, as it significantly impacts performance and cost. Could you please assist them with this matter?
@raajChit @drorm @eyurtsev ; I'm struggling with this one too. I have 80k token prompt that has some placeholders there. The moment I use SystemMessage, I get caching to work but I'm loosing the tool filling in the placeholders. If I do replace that with SystemMessagePromptTemplate, placeholders receive values but caching is not working.
It is not possible by design to cache that because it is not static prompt??
I've posted a discussion on #29747 to enable the feature mentioned in this issue.
After initial investigation, it appears the only method for passing arbitrary key-value pairs (e.g.,
cache_control) in chat or system messages to the Anthropic client is by structuring the message content as a list[dict] when creating an instance of a BaseMessage subclass.
An example as a reference:
messages = [
HumanMessage(
content=[{
"type": "text",
"text": TRANSCRIPT,
"cache_control": {
"type": "ephemeral",
}
}],
),
HumanMessage(
content="Summarize the transcript in 2-3 sentences.",
),
]
response = model.invoke(messages)
response.usage_metadata
Now when working with _StringImageMessagePromptTemplate subclasses (eg. HumanMessagePromptTemplate) this doesn't work. I looked at the code and found the reason in the following code blocks:
in _StringImageMessagePromptTemplate.from_template method
https://github.com/langchain-ai/langchain/blob/33354f984fba660e71ca0039cfbd3cf37643cfab/libs/core/langchain_core/prompts/chat.py#L523-L538
in _StringImageMessagePromptTemplate.format method
https://github.com/langchain-ai/langchain/blob/33354f984fba660e71ca0039cfbd3cf37643cfab/libs/core/langchain_core/prompts/chat.py#L647-L658
So you can see that when it receives a list of dict with text key, it converts it to a StringPromptTemplate and drops all the additional properties/kwargs. In the format method, it creates a new dict but it cannot and doesn't push additional properties/kwargs present in the original message.
One solution I can think of with minimal change is to store these additional properties while creating an instance of PromptTemplate and access them in the format method before sending back the response.
@baskaryan @ccurme If this sounds good then I can potentially open a PR for this
I have a partial solution. It reduced my cost by 2/3, but there's still some work to do here. I created my own version of https://github.com/langchain-ai/langchain/blob/master/libs/partners/anthropic/langchain_anthropic/chat_models.py at https://github.com/drorm/vmpilot/blob/main/src/vmpilot/caching/chat_models.py and then in https://github.com/drorm/vmpilot/blob/main/src/vmpilot/ you can see how I mark blocks as ephemeral:
agent.py:79: block["cache_control"] = {"type": "ephemeral"}
agent.py:83: message.additional_kwargs["cache_control"] = {"type": "ephemeral"}
agent.py:175: system_content["cache_control"] = {"type": "ephemeral"}
vmpilot.py:261: "type": "ephemeral"
This results in:
INFO: 157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK
{'cache_creation_input_tokens': 1694, 'cache_read_input_tokens': 2521, 'input_tokens': 4, 'output_tokens': 172}
{'cache_creation_input_tokens': 208, 'cache_read_input_tokens': 4215, 'input_tokens': 6, 'output_tokens': 317}
{'cache_creation_input_tokens': 331, 'cache_read_input_tokens': 4423, 'input_tokens': 6, 'output_tokens': 98}
{'cache_creation_input_tokens': 352, 'cache_read_input_tokens': 4754, 'input_tokens': 6, 'output_tokens': 475}
{'cache_creation_input_tokens': 514, 'cache_read_input_tokens': 5106, 'input_tokens': 6, 'output_tokens': 157}
{'cache_creation_input_tokens': 195, 'cache_read_input_tokens': 5620, 'input_tokens': 6, 'output_tokens': 309}
{'cache_creation_input_tokens': 347, 'cache_read_input_tokens': 5815, 'input_tokens': 6, 'output_tokens': 155}
{'cache_creation_input_tokens': 193, 'cache_read_input_tokens': 6162, 'input_tokens': 6, 'output_tokens': 330}
{'cache_creation_input_tokens': 368, 'cache_read_input_tokens': 6355, 'input_tokens': 6, 'output_tokens': 118}
{'cache_creation_input_tokens': 174, 'cache_read_input_tokens': 6723, 'input_tokens': 6, 'output_tokens': 102}
{'cache_creation_input_tokens': 298, 'cache_read_input_tokens': 6897, 'input_tokens': 6, 'output_tokens': 107}
{'cache_creation_input_tokens': 332, 'cache_read_input_tokens': 7195, 'input_tokens': 6, 'output_tokens': 211}
ne 25 steps in a row. Let me know if you'd like me to continue.
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
vmpilot.anthropic
vmpilot.anthropic
INFO: 157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK
{'cache_creation_input_tokens': 980, 'cache_read_input_tokens': 4215, 'input_tokens': 4, 'output_tokens': 732}
{'cache_creation_input_tokens': 743, 'cache_read_input_tokens': 5195, 'input_tokens': 6, 'output_tokens': 99}
{'cache_creation_input_tokens': 324, 'cache_read_input_tokens': 5938, 'input_tokens': 6, 'output_tokens': 488}
{'cache_creation_input_tokens': 525, 'cache_read_input_tokens': 6262, 'input_tokens': 6, 'output_tokens': 107}
{'cache_creation_input_tokens': 345, 'cache_read_input_tokens': 6787, 'input_tokens': 6, 'output_tokens': 86}
{'cache_creation_input_tokens': 498, 'cache_read_input_tokens': 7132, 'input_tokens': 6, 'output_tokens': 86}
{'cache_creation_input_tokens': 329, 'cache_read_input_tokens': 7630, 'input_tokens': 6, 'output_tokens': 85}
{'cache_creation_input_tokens': 472, 'cache_read_input_tokens': 7959, 'input_tokens': 6, 'output_tokens': 85}
{'cache_creation_input_tokens': 577, 'cache_read_input_tokens': 8431, 'input_tokens': 6, 'output_tokens': 84}
{'cache_creation_input_tokens': 289, 'cache_read_input_tokens': 9008, 'input_tokens': 6, 'output_tokens': 435}
{'cache_creation_input_tokens': 472, 'cache_read_input_tokens': 9297, 'input_tokens': 6, 'output_tokens': 401}
I'm planning on fixing this in https://github.com/drorm/vmpilot/issues/27. Subscribe if you want to keep track. I expect an improvement of another 20%-40%, but won't know for sure till I've implemented it.
I didn't submit this as a pull request because this needs much more work and testing for a generalized solution. For my purpose, this works fine.
This is working now, see: https://python.langchain.com/docs/integrations/chat/anthropic/#incremental-caching-in-conversational-applications
@0ca this was working before as well but great to see it being added to the official docs as an example.
This issue actually focused on it working with ChatPromptTemplate. Like I mentioned above ChatPromptTempate right now only looks at type and text fields and ignores additional fields like cache_control and recreates the msg again.
Right now to work around this we manually call format_messages first and then add back the cache_control to the output messages
This is all working correctly for me now, but I had to hack the code. Here's a log demonstrating it:
INFO: 157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK
2025-03-27 19:04:35,385 - vmpilot.chat - INFO - Changed to project directory: /home/dror/vmpilot
2025-03-27 19:04:35,388 - vmpilot.exchange - INFO - New exchange started for chat_id: 5384alGfieQq
2025-03-27 19:04:35,402 - vmpilot.agent - INFO - Started new chat session with thread_id: 5384alGfieQq
2025-03-27 19:04:38,571 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 2634, 'cache_read_input_tokens': 0, 'input_tokens': 4, 'output_tokens': 107}
2025-03-27 19:04:40,969 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 398, 'cache_read_input_tokens': 2634, 'input_tokens': 6, 'output_tokens': 144}
2025-03-27 19:04:43,338 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 861, 'cache_read_input_tokens': 3032, 'input_tokens': 6, 'output_tokens': 108}
2025-03-27 19:04:45,100 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 534, 'cache_read_input_tokens': 3893, 'input_tokens': 6, 'output_tokens': 103}
2025-03-27 19:04:47,716 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 2335, 'cache_read_input_tokens': 4427, 'input_tokens': 6, 'output_tokens': 135}
2025-03-27 19:04:50,531 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 419, 'cache_read_input_tokens': 6762, 'input_tokens': 6, 'output_tokens': 122}
2025-03-27 19:04:53,134 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 435, 'cache_read_input_tokens': 7181, 'input_tokens': 6, 'output_tokens': 133}
2025-03-27 19:04:55,857 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 1018, 'cache_read_input_tokens': 7616, 'input_tokens': 6, 'output_tokens': 138}
2025-03-27 19:05:07,889 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 325, 'cache_read_input_tokens': 8634, 'input_tokens': 6, 'output_tokens': 1086}
2025-03-27 19:05:14,317 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 1126, 'cache_read_input_tokens': 8959, 'input_tokens': 6, 'output_tokens': 431}
2025-03-27 19:05:18,895 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 472, 'cache_read_input_tokens': 10085, 'input_tokens': 6, 'output_tokens': 347}
2025-03-27 19:05:22,480 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 388, 'cache_read_input_tokens': 10557, 'input_tokens': 6, 'output_tokens': 123}
2025-03-27 19:05:26,969 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 344, 'cache_read_input_tokens': 10945, 'input_tokens': 6, 'output_tokens': 122}
2025-03-27 19:05:32,816 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 249, 'cache_read_input_tokens': 11289, 'input_tokens': 6, 'output_tokens': 300}
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
vmpilot2.anthropic
vmpilot2.anthropic
INFO: 157.131.22.45:0 - "POST /chat/completions HTTP/1.1" 200 OK
2025-03-27 19:07:03,426 - vmpilot.exchange - INFO - New exchange started for chat_id: 5384alGfieQq
2025-03-27 19:07:03,438 - vmpilot.agent - INFO - Retrieved previous conversation state with 28 messages for thread_id: 5384alGfieQq
2025-03-27 19:07:07,353 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 325, 'cache_read_input_tokens': 11538, 'input_tokens': 4, 'output_tokens': 172}
2025-03-27 19:07:10,519 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 293, 'cache_read_input_tokens': 11863, 'input_tokens': 6, 'output_tokens': 131}
2025-03-27 19:07:13,566 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 1010, 'cache_read_input_tokens': 12156, 'input_tokens': 6, 'output_tokens': 132}
2025-03-27 19:07:16,565 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 653, 'cache_read_input_tokens': 13166, 'input_tokens': 6, 'output_tokens': 128}
2025-03-27 19:07:19,667 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 441, 'cache_read_input_tokens': 13819, 'input_tokens': 6, 'output_tokens': 134}
2025-03-27 19:07:22,972 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 548, 'cache_read_input_tokens': 14260, 'input_tokens': 6, 'output_tokens': 118}
2025-03-27 19:07:25,680 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 171, 'cache_read_input_tokens': 14808, 'input_tokens': 6, 'output_tokens': 114}
2025-03-27 19:07:29,306 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 347, 'cache_read_input_tokens': 14979, 'input_tokens': 6, 'output_tokens': 152}
2025-03-27 19:07:38,938 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 434, 'cache_read_input_tokens': 15326, 'input_tokens': 6, 'output_tokens': 574}
2025-03-27 19:07:42,775 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 615, 'cache_read_input_tokens': 15760, 'input_tokens': 6, 'output_tokens': 179}
2025-03-27 19:07:45,667 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 193, 'cache_read_input_tokens': 16375, 'input_tokens': 6, 'output_tokens': 110}
2025-03-27 19:07:49,393 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 288, 'cache_read_input_tokens': 16568, 'input_tokens': 6, 'output_tokens': 170}
2025-03-27 19:07:54,977 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 211, 'cache_read_input_tokens': 16856, 'input_tokens': 6, 'output_tokens': 261}
2025-03-27 19:08:02,024 - vmpilot.agent_logging - INFO - TOKEN_USAGE: {'cache_creation_input_tokens': 302, 'cache_read_input_tokens': 17067, 'input_tokens': 6, 'output_tokens': 440}
INFO: 157.131.22.45:0 - "GET /models HTTP/1.1" 200 OK
Notice how the second exchange continues 'cache_read_input_tokens': 11538, which is the sum of the last message in the previous exchange.
What I did: I created my own version of https://github.com/langchain-ai/langchain/blob/master/libs/partners/anthropic/langchain_anthropic/chat_models.py in https://github.com/drorm/vmpilot/blob/main/src/vmpilot/caching/chat_models.py and then in https://github.com/drorm/vmpilot/blob/main/src/vmpilot/ look at agent.py and vmpilot.py You can see how I I set ephemeral. (The code is a bit messy right now. Claude and I are refactoring it :-)).
I've gone up to 50K cached tokens, but starting around 20K - 30K the quality starts degrading and the speed becomes painful and I start seeing timeouts.
This is working now, see: https://python.langchain.com/docs/integrations/chat/anthropic/#incremental-caching-in-conversational-applications
Still doesnt work for me
This should be closed by https://github.com/langchain-ai/langchain/pull/30967 upon release. Here's an example:
Define prompt:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate(
[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a technology expert.",
},
{
"type": "text",
"text": "{context}",
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "{query}",
},
]
)
Usage:
import requests
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-3-7-sonnet-20250219")
# Pull LangChain readme
get_response = requests.get(
"https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md"
)
readme = get_response.text
chain = prompt | llm
response_1 = chain.invoke(
{
"context": readme,
"query": "What's LangChain, according to its README?",
}
)
response_2 = chain.invoke(
{
"context": readme,
"query": "Extract a link to the LangChain tutorials.",
}
)
usage_1 = response_1.usage_metadata["input_token_details"]
usage_2 = response_2.usage_metadata["input_token_details"]
print(f"First invocation:\n{usage_1}")
print(f"\nSecond:\n{usage_2}")
# First invocation:
# {'cache_read': 0, 'cache_creation': 1519}
# Second:
# {'cache_read': 1519, 'cache_creation': 0}
How to apply caching on system prompt and tools in below scenario (when created using from messages):
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
instruction_tone_combined_prompt_value,
),
MessagesPlaceholder(variable_name="chat_history"),
MessagesPlaceholder(variable_name="user_profile"),
("user", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
]
)
I ran into the same issue – the system prompt was not being cached when using Claude – and I found out it was because of the minimum cacheable prompt length required by the Claude models:
The minimum cacheable prompt length is:
- 1024 tokens for Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 [deprecated] and Claude Opus 3 [deprecated]
- 2048 tokens for Claude Haiku 3.5 and Claude Haiku 3
Shorter prompts cannot be cached, even if marked with cache_control...
Source: Prompt caching | Cache limitations
Fulfilling the prompt length condition and using the same prompt definition syntax as mentioned by ccurme in the comment above worked for me.