pydantic-ai icon indicating copy to clipboard operation
pydantic-ai copied to clipboard

Gemini Agent run from cached context

Open tranhoangnguyen03 opened this issue 1 year ago • 25 comments

I have a corpus that I cache using Gemini context caching.

from google.generativeai import caching

cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='book_123_abc', # used to identify the cache
    system_instruction=(
        'Your job is to answer the user\'s query based on the book you have access to.'
    ),
    contents=[md_book_123_abc]
)

I want to set up a gemini agent dedicated to this cached corpus. Does Pydantic AI support this workflow?

tranhoangnguyen03 avatar Jan 13 '25 06:01 tranhoangnguyen03

@tranhoangnguyen03,

Could you please provide a bit more info here? This seems like a RAG sort of situation?

sydney-runkle avatar Jan 24 '25 15:01 sydney-runkle

Not necessarily a RAG situation, though I imagine it could be the case. I have 3 scenarios where context caching might be useful.

  1. Data processing:
  • I have 10_000 text chunks that I need to process.
  • I write an instruction prompt with lots of in-context examples to cover many edge cases. The instruction length might be thousands of tokens long
  • since every request will contain this long instruction as input, I might as well cache it and cut down input processing cost significantly
  1. Resampling:
  • I need to do, say, sentiment analysis on customer feedbacks. But I also need to produce a metrics for certainty.
  • I can sample sentiment for the same customer feedback n times and observe the normalized frequency for "positive" label -> my proxy for certainty
  • since I'm running the same prompt over and over, caching the prompt will reduce the input processing cost significantly
  1. document-based RAG:
  • I have a retrieval system to match the user query to the most relevant document.
  • The entire document will be used to answer the question.
  • I can cache the document to reduce the input processing cost since each document will be repeatedly used by many queries

Hope it clears things up. @sydney-runkle

tranhoangnguyen03 avatar Feb 01 '25 08:02 tranhoangnguyen03

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Feb 11 '25 14:02 github-actions[bot]

@sydney-runkle any idea if this is feasible?

tranhoangnguyen03 avatar Feb 12 '25 02:02 tranhoangnguyen03

+1 for this. Would help optimize use cases of running multiple completions over the same cached document. Implementation could be simple - just add a cachedContent: "$CACHE_NAME" parameter to the completion payload: curl -X POST "${BASE_URL}/${MODEL}:generateContent?key=$GOOGLE_API_KEY" \ -H 'Content-Type: application/json' \ -d '{ "contents": [ { "parts":[{ "text": "'$PROMPT'" }], "role": "user" } ], "cachedContent": "'$CACHE_NAME'" }'

aubinmazet avatar Feb 12 '25 14:02 aubinmazet

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Feb 20 '25 14:02 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

github-actions[bot] avatar Feb 24 '25 14:02 github-actions[bot]

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Mar 04 '25 14:03 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

github-actions[bot] avatar Mar 08 '25 14:03 github-actions[bot]

+1 agreed. This is a show stopper to using pydantic because cost is an issue.

heyjohnlim avatar Mar 10 '25 04:03 heyjohnlim

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Mar 17 '25 14:03 github-actions[bot]

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Mar 25 '25 14:03 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

github-actions[bot] avatar Mar 29 '25 14:03 github-actions[bot]

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Apr 06 '25 14:04 github-actions[bot]

+1. I am running agentic systems where the planner and deep research agents need to access the cached data multiple times - every time a tool fires, the data is reread again.

Frankly i now just access the OpenAI and Anthropic API's directly because i need to save costs. I just +1 to keep it open to show pydantic needs to consider this scenario.

heyjohnlim avatar Apr 08 '25 03:04 heyjohnlim

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Apr 15 '25 14:04 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

github-actions[bot] avatar Apr 20 '25 14:04 github-actions[bot]

+1 for this, maybe add a way to pass the cache_name to the model parameters during run time. This would enable huge cost optimization on my end.

brunorpinho avatar May 02 '25 16:05 brunorpinho

+1 ! Would be a game changer in our projects

PierreFaraut avatar May 05 '25 10:05 PierreFaraut

+1. Same for us.

ethanabrooks avatar May 05 '25 16:05 ethanabrooks

And here. With the massive context window that Gemini offers, context caching is a huge cost saver.

richard-oscaridp avatar May 09 '25 15:05 richard-oscaridp

Now that Gemini does implicit caching for Gemini 2.5 models , is it possible for us to tell whether we hit cache or missed cache?

That said, the lion share of gemini usage is still on Gemini 2 Flash and Gemini 2 Flash Lite. Does Pydantic AI have plan to support explicit cache for non-2.5 models?

tranhoangnguyen03 avatar May 12 '25 04:05 tranhoangnguyen03

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar May 19 '25 14:05 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

github-actions[bot] avatar May 23 '25 14:05 github-actions[bot]

Supposedly, you are able to see if it was a hit in the result.usage(), but I'm unable to hit the cache after several tries.

Kludex avatar May 26 '25 07:05 Kludex

Now that Gemini does implicit caching for Gemini 2.5 models , is it possible for us to tell whether we hit cache or missed cache?

That said, the lion share of gemini usage is still on Gemini 2 Flash and Gemini 2 Flash Lite. Does Pydantic AI have plan to support explicit cache for non-2.5 models?

for OpenRouter you can use extra_body in Agent definition and see responses results (in logfire or ...).

https://openrouter.ai/docs/features/prompt-caching

Agent(...,
model_settings={
                "extra_body": {"usage": {"include": True}},
            })

bbkgh avatar Jun 01 '25 10:06 bbkgh

This issue is stale, and will be closed in 3 days if no reply is received.

github-actions[bot] avatar Jun 08 '25 14:06 github-actions[bot]

Closing this issue as it has been inactive for 10 days.

github-actions[bot] avatar Jun 12 '25 14:06 github-actions[bot]

I'm also looking for this! My use-case is pretty similar to the ones described here already: I have a Markdown file with a lot of context, and it's loaded on a tool call by every agent.

@agent.tool_plain
def load_context() -> str:
    """Load context information.

    Returns:
        Comprehensive information about products, target customers, and
        industry focus.
    """
    return (Path(__file__).parent / "context.md").read_text(encoding="utf-8")

I am, however, using OpenAI, but the idea is the same.

MicaelJarniac avatar Jul 24 '25 15:07 MicaelJarniac

I would also love to see this feature implemented! I'm firing a high number of requests over a short period of time and I would love to cache my system prompt to reduce costs while still being able to use pydantic-ai.

herostavl avatar Jul 30 '25 09:07 herostavl