Gemini Agent run from cached context
I have a corpus that I cache using Gemini context caching.
from google.generativeai import caching
cache = caching.CachedContent.create(
model='models/gemini-1.5-flash-001',
display_name='book_123_abc', # used to identify the cache
system_instruction=(
'Your job is to answer the user\'s query based on the book you have access to.'
),
contents=[md_book_123_abc]
)
I want to set up a gemini agent dedicated to this cached corpus. Does Pydantic AI support this workflow?
@tranhoangnguyen03,
Could you please provide a bit more info here? This seems like a RAG sort of situation?
Not necessarily a RAG situation, though I imagine it could be the case. I have 3 scenarios where context caching might be useful.
- Data processing:
- I have 10_000 text chunks that I need to process.
- I write an instruction prompt with lots of in-context examples to cover many edge cases. The instruction length might be thousands of tokens long
- since every request will contain this long instruction as input, I might as well cache it and cut down input processing cost significantly
- Resampling:
- I need to do, say, sentiment analysis on customer feedbacks. But I also need to produce a metrics for certainty.
- I can sample sentiment for the same customer feedback n times and observe the normalized frequency for "positive" label -> my proxy for certainty
- since I'm running the same prompt over and over, caching the prompt will reduce the input processing cost significantly
- document-based RAG:
- I have a retrieval system to match the user query to the most relevant document.
- The entire document will be used to answer the question.
- I can cache the document to reduce the input processing cost since each document will be repeatedly used by many queries
Hope it clears things up. @sydney-runkle
This issue is stale, and will be closed in 3 days if no reply is received.
@sydney-runkle any idea if this is feasible?
+1 for this. Would help optimize use cases of running multiple completions over the same cached document. Implementation could be simple - just add a cachedContent: "$CACHE_NAME" parameter to the completion payload:
curl -X POST "${BASE_URL}/${MODEL}:generateContent?key=$GOOGLE_API_KEY" \ -H 'Content-Type: application/json' \ -d '{ "contents": [ { "parts":[{ "text": "'$PROMPT'" }], "role": "user" } ], "cachedContent": "'$CACHE_NAME'" }'
This issue is stale, and will be closed in 3 days if no reply is received.
Closing this issue as it has been inactive for 10 days.
This issue is stale, and will be closed in 3 days if no reply is received.
Closing this issue as it has been inactive for 10 days.
+1 agreed. This is a show stopper to using pydantic because cost is an issue.
This issue is stale, and will be closed in 3 days if no reply is received.
This issue is stale, and will be closed in 3 days if no reply is received.
Closing this issue as it has been inactive for 10 days.
This issue is stale, and will be closed in 3 days if no reply is received.
+1. I am running agentic systems where the planner and deep research agents need to access the cached data multiple times - every time a tool fires, the data is reread again.
Frankly i now just access the OpenAI and Anthropic API's directly because i need to save costs. I just +1 to keep it open to show pydantic needs to consider this scenario.
This issue is stale, and will be closed in 3 days if no reply is received.
Closing this issue as it has been inactive for 10 days.
+1 for this, maybe add a way to pass the cache_name to the model parameters during run time. This would enable huge cost optimization on my end.
+1 ! Would be a game changer in our projects
+1. Same for us.
And here. With the massive context window that Gemini offers, context caching is a huge cost saver.
Now that Gemini does implicit caching for Gemini 2.5 models , is it possible for us to tell whether we hit cache or missed cache?
That said, the lion share of gemini usage is still on Gemini 2 Flash and Gemini 2 Flash Lite. Does Pydantic AI have plan to support explicit cache for non-2.5 models?
This issue is stale, and will be closed in 3 days if no reply is received.
Closing this issue as it has been inactive for 10 days.
Supposedly, you are able to see if it was a hit in the result.usage(), but I'm unable to hit the cache after several tries.
Now that Gemini does implicit caching for Gemini 2.5 models , is it possible for us to tell whether we hit cache or missed cache?
That said, the lion share of gemini usage is still on Gemini 2 Flash and Gemini 2 Flash Lite. Does Pydantic AI have plan to support explicit cache for non-2.5 models?
for OpenRouter you can use extra_body in Agent definition and see responses results (in logfire or ...).
https://openrouter.ai/docs/features/prompt-caching
Agent(...,
model_settings={
"extra_body": {"usage": {"include": True}},
})
This issue is stale, and will be closed in 3 days if no reply is received.
Closing this issue as it has been inactive for 10 days.
I'm also looking for this! My use-case is pretty similar to the ones described here already: I have a Markdown file with a lot of context, and it's loaded on a tool call by every agent.
@agent.tool_plain
def load_context() -> str:
"""Load context information.
Returns:
Comprehensive information about products, target customers, and
industry focus.
"""
return (Path(__file__).parent / "context.md").read_text(encoding="utf-8")
I am, however, using OpenAI, but the idea is the same.
I would also love to see this feature implemented! I'm firing a high number of requests over a short period of time and I would love to cache my system prompt to reduce costs while still being able to use pydantic-ai.