litellm Gemini API: Context Caching

          First: We ❤️ LiteLLM

I wish it supported the new Gemini context caching: https://ai.google.dev/gemini-api/docs/caching?lang=python

I admit I haven't thought the API through well, since this is a feature that only one providers offers at this point (but it likely won't be the last).

Originally posted by @Taytay in https://github.com/BerriAI/litellm/issues/361#issuecomment-2177657893

Jun 19 '24 16:06 krrishdholakia

hi @Taytay happy to add this - can you help us out with:

the ideal interface to use this with litellm ?

Jun 25 '24 19:06 ishaan-jaff

@krrishdholakia +1 for this feature. Waiting for this to get implemented. Any timeframe please? @ishaan-jaff

Jul 21 '24 01:07 MervinPraison

hi @MervinPraison - do you want to use this with LiteLLM SDK or proxy server ?

Jul 21 '24 01:07 ishaan-jaff

Hey, i explored this briefly - open question how does storing to the cache work?

so would this be like a callback?

Jul 21 '24 01:07 krrishdholakia

@ishaan-jaff Using it with LiteLLM SDK. So that it uses the cache each time the API Call is made @krrishdholakia It's like an extra parameter when defining the model (genai.GenerativeModel.from_cached_content(cached_content=cache))

3 steps:

Upload the file (genai.upload_file)
Create the cache (caching.CachedContent.create)
Use the cache when defining the model (genai.GenerativeModel.from_cached_content(cached_content=cache))

Note:

Step 1 and 2 can be handled separately. Key is step 3 with this extra parameter model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4

path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'

# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)

# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
  print('Waiting for video to be processed.')
  time.sleep(2)
  video_file = genai.get_file(video_file.name)

print(f'Video processing complete: {video_file.uri}')

# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='sherlock jr movie', # used to identify the cache
    system_instruction=(
        'You are an expert video analyzer, and your job is to answer '
        'the user\'s query based on the video file you have access to.'
    ),
    contents=[video_file],
    ttl=datetime.timedelta(minutes=5),
)

# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query the model
response = model.generate_content([(
    'Introduce different characters in the movie by describing '
    'their personality, looks, and names. Also list the timestamps '
    'they were introduced for the first time.')])

print(response.usage_metadata)

Jul 22 '24 05:07 MervinPraison

This could be an idea. So LiteLLM completion function will add an extra param cached_content. (implementing only Step 3 as mentioned above)

response = completion(
    model="gemini/gemini-1.5-pro", 
    cached_content = cache,
    messages=[{"role": "user", "content": "Introduce different characters in the uploaded movie"}]
)

Jul 22 '24 05:07 MervinPraison

We already support this for Vertex AI (not Google AI Studio though).

https://github.com/BerriAI/litellm/pull/4492

Jul 31 '24 15:07 Manouchehri

@Manouchehri Thanks for that. Also it would be a seamless experience if this is implemented for Google AI Studio.

Because vertex AI would require creating a dedicated GCP project for each individual user and also it's location specific, which is making things complicated.

Jul 31 '24 18:07 MervinPraison

@Manouchehri @ishaan-jaff FYI Now Deepseek also introduced context caching. I believe soon all other apis will include this feature

https://platform.deepseek.com/api-docs/news/news0802/

Aug 03 '24 14:08 MervinPraison

hey @MervinPraison deepseek caching is a bit different (runs automatically on every chat completion request) and it's already supported by litellm

Aug 03 '24 15:08 krrishdholakia

Caching added on Anthropic also now

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Aug 15 '24 09:08 MervinPraison

Prompt Caching | liteLLM does not mention supporting Gemini. Is this supported or not? I am confused.

Aug 10 '25 20:08 NightMachinery

@NightMachinery According to litellm.utils.supports_prompt_caching none of the Gemini models supports it as of now:

from litellm.utils import supports_prompt_caching

gemini_keys = [
    'google/gemini-1-5-pro-latest',
    'google/gemini-2-0-flash',
    'google/gemini-2-0-pro',
    'google/gemini-2-5-flash',
    'google/gemini-2-5-pro',
]

for key in gemini_keys:
    print(f"Model: {key}, Supports Prompt Caching: {supports_prompt_caching(model=key)}")

Provider List: https://docs.litellm.ai/docs/providers

Model: google/gemini-1-5-pro-latest, Supports Prompt Caching: False

Provider List: https://docs.litellm.ai/docs/providers

Model: google/gemini-2-0-flash, Supports Prompt Caching: False

Provider List: https://docs.litellm.ai/docs/providers

Model: google/gemini-2-0-pro, Supports Prompt Caching: False

Provider List: https://docs.litellm.ai/docs/providers

Model: google/gemini-2-5-flash, Supports Prompt Caching: False

Provider List: https://docs.litellm.ai/docs/providers

Model: google/gemini-2-5-pro, Supports Prompt Caching: False

Aug 15 '25 09:08 MaxFeucht