pydantic-ai Gemini Agent run from cached context

I have a corpus that I cache using Gemini context caching.

from google.generativeai import caching

cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='book_123_abc', # used to identify the cache
    system_instruction=(
        'Your job is to answer the user\'s query based on the book you have access to.'
    ),
    contents=[md_book_123_abc]
)

I want to set up a gemini agent dedicated to this cached corpus. Does Pydantic AI support this workflow?

Jan 13 '25 06:01 tranhoangnguyen03

@tranhoangnguyen03,

Could you please provide a bit more info here? This seems like a RAG sort of situation?

Jan 24 '25 15:01 sydney-runkle

Not necessarily a RAG situation, though I imagine it could be the case. I have 3 scenarios where context caching might be useful.

Data processing:

I have 10_000 text chunks that I need to process.
I write an instruction prompt with lots of in-context examples to cover many edge cases. The instruction length might be thousands of tokens long
since every request will contain this long instruction as input, I might as well cache it and cut down input processing cost significantly

Resampling:

I need to do, say, sentiment analysis on customer feedbacks. But I also need to produce a metrics for certainty.
I can sample sentiment for the same customer feedback n times and observe the normalized frequency for "positive" label -> my proxy for certainty
since I'm running the same prompt over and over, caching the prompt will reduce the input processing cost significantly

document-based RAG:

I have a retrieval system to match the user query to the most relevant document.
The entire document will be used to answer the question.
I can cache the document to reduce the input processing cost since each document will be repeatedly used by many queries

Hope it clears things up. @sydney-runkle

Feb 01 '25 08:02 tranhoangnguyen03

This issue is stale, and will be closed in 3 days if no reply is received.

Feb 11 '25 14:02 github-actions[bot]

@sydney-runkle any idea if this is feasible?

Feb 12 '25 02:02 tranhoangnguyen03

+1 for this. Would help optimize use cases of running multiple completions over the same cached document. Implementation could be simple - just add a cachedContent: "$CACHE_NAME" parameter to the completion payload: curl -X POST "${BASE_URL}/${MODEL}:generateContent?key=$GOOGLE_API_KEY" \ -H 'Content-Type: application/json' \ -d '{ "contents": [ { "parts":[{ "text": "'$PROMPT'" }], "role": "user" } ], "cachedContent": "'$CACHE_NAME'" }'

Feb 12 '25 14:02 aubinmazet

This issue is stale, and will be closed in 3 days if no reply is received.

Feb 20 '25 14:02 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

Feb 24 '25 14:02 github-actions[bot]

This issue is stale, and will be closed in 3 days if no reply is received.

Mar 04 '25 14:03 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

Mar 08 '25 14:03 github-actions[bot]

+1 agreed. This is a show stopper to using pydantic because cost is an issue.

Mar 10 '25 04:03 heyjohnlim

This issue is stale, and will be closed in 3 days if no reply is received.

Mar 17 '25 14:03 github-actions[bot]

This issue is stale, and will be closed in 3 days if no reply is received.

Mar 25 '25 14:03 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

Mar 29 '25 14:03 github-actions[bot]

This issue is stale, and will be closed in 3 days if no reply is received.

Apr 06 '25 14:04 github-actions[bot]

+1. I am running agentic systems where the planner and deep research agents need to access the cached data multiple times - every time a tool fires, the data is reread again.

Frankly i now just access the OpenAI and Anthropic API's directly because i need to save costs. I just +1 to keep it open to show pydantic needs to consider this scenario.

Apr 08 '25 03:04 heyjohnlim

This issue is stale, and will be closed in 3 days if no reply is received.

Apr 15 '25 14:04 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

Apr 20 '25 14:04 github-actions[bot]

+1 for this, maybe add a way to pass the cache_name to the model parameters during run time. This would enable huge cost optimization on my end.

May 02 '25 16:05 brunorpinho

+1 ! Would be a game changer in our projects

May 05 '25 10:05 PierreFaraut

+1. Same for us.

May 05 '25 16:05 ethanabrooks

And here. With the massive context window that Gemini offers, context caching is a huge cost saver.

May 09 '25 15:05 richard-oscaridp

Now that Gemini does implicit caching for Gemini 2.5 models , is it possible for us to tell whether we hit cache or missed cache?

That said, the lion share of gemini usage is still on Gemini 2 Flash and Gemini 2 Flash Lite. Does Pydantic AI have plan to support explicit cache for non-2.5 models?

May 12 '25 04:05 tranhoangnguyen03

This issue is stale, and will be closed in 3 days if no reply is received.

May 19 '25 14:05 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

May 23 '25 14:05 github-actions[bot]

Supposedly, you are able to see if it was a hit in the result.usage(), but I'm unable to hit the cache after several tries.

May 26 '25 07:05 Kludex

Now that Gemini does implicit caching for Gemini 2.5 models , is it possible for us to tell whether we hit cache or missed cache?

That said, the lion share of gemini usage is still on Gemini 2 Flash and Gemini 2 Flash Lite. Does Pydantic AI have plan to support explicit cache for non-2.5 models?

for OpenRouter you can use extra_body in Agent definition and see responses results (in logfire or ...).

https://openrouter.ai/docs/features/prompt-caching

Agent(...,
model_settings={
                "extra_body": {"usage": {"include": True}},
            })

Jun 01 '25 10:06 bbkgh

This issue is stale, and will be closed in 3 days if no reply is received.

Jun 08 '25 14:06 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

Jun 12 '25 14:06 github-actions[bot]

I'm also looking for this! My use-case is pretty similar to the ones described here already: I have a Markdown file with a lot of context, and it's loaded on a tool call by every agent.

@agent.tool_plain
def load_context() -> str:
    """Load context information.

    Returns:
        Comprehensive information about products, target customers, and
        industry focus.
    """
    return (Path(__file__).parent / "context.md").read_text(encoding="utf-8")

I am, however, using OpenAI, but the idea is the same.

Jul 24 '25 15:07 MicaelJarniac

I would also love to see this feature implemented! I'm firing a high number of requests over a short period of time and I would love to cache my system prompt to reduce costs while still being able to use pydantic-ai.

Jul 30 '25 09:07 herostavl