Gemini API: Context Caching
First: We ❤️ LiteLLM
I wish it supported the new Gemini context caching: https://ai.google.dev/gemini-api/docs/caching?lang=python
I admit I haven't thought the API through well, since this is a feature that only one providers offers at this point (but it likely won't be the last).
Originally posted by @Taytay in https://github.com/BerriAI/litellm/issues/361#issuecomment-2177657893
hi @Taytay happy to add this - can you help us out with:
- the ideal interface to use this with litellm ?
@krrishdholakia +1 for this feature. Waiting for this to get implemented. Any timeframe please? @ishaan-jaff
hi @MervinPraison - do you want to use this with LiteLLM SDK or proxy server ?
Hey, i explored this briefly - open question how does storing to the cache work?
so would this be like a callback?
@ishaan-jaff Using it with LiteLLM SDK. So that it uses the cache each time the API Call is made
@krrishdholakia It's like an extra parameter when defining the model (genai.GenerativeModel.from_cached_content(cached_content=cache))
3 steps:
- Upload the file (
genai.upload_file) - Create the cache (
caching.CachedContent.create) - Use the cache when defining the model (
genai.GenerativeModel.from_cached_content(cached_content=cache))
Note:
Step 1 and 2 can be handled separately.
Key is step 3 with this extra parameter model = genai.GenerativeModel.from_cached_content(cached_content=cache)
# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4
path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'
# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)
# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
print('Waiting for video to be processed.')
time.sleep(2)
video_file = genai.get_file(video_file.name)
print(f'Video processing complete: {video_file.uri}')
# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
model='models/gemini-1.5-flash-001',
display_name='sherlock jr movie', # used to identify the cache
system_instruction=(
'You are an expert video analyzer, and your job is to answer '
'the user\'s query based on the video file you have access to.'
),
contents=[video_file],
ttl=datetime.timedelta(minutes=5),
)
# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
# Query the model
response = model.generate_content([(
'Introduce different characters in the movie by describing '
'their personality, looks, and names. Also list the timestamps '
'they were introduced for the first time.')])
print(response.usage_metadata)
This could be an idea. So LiteLLM completion function will add an extra param cached_content. (implementing only Step 3 as mentioned above)
response = completion(
model="gemini/gemini-1.5-pro",
cached_content = cache,
messages=[{"role": "user", "content": "Introduce different characters in the uploaded movie"}]
)
We already support this for Vertex AI (not Google AI Studio though).
https://github.com/BerriAI/litellm/pull/4492
@Manouchehri Thanks for that. Also it would be a seamless experience if this is implemented for Google AI Studio.
Because vertex AI would require creating a dedicated GCP project for each individual user and also it's location specific, which is making things complicated.
@Manouchehri @ishaan-jaff FYI Now Deepseek also introduced context caching. I believe soon all other apis will include this feature
https://platform.deepseek.com/api-docs/news/news0802/
hey @MervinPraison deepseek caching is a bit different (runs automatically on every chat completion request) and it's already supported by litellm
Caching added on Anthropic also now
https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Prompt Caching | liteLLM does not mention supporting Gemini. Is this supported or not? I am confused.
@NightMachinery According to litellm.utils.supports_prompt_caching none of the Gemini models supports it as of now:
from litellm.utils import supports_prompt_caching
gemini_keys = [
'google/gemini-1-5-pro-latest',
'google/gemini-2-0-flash',
'google/gemini-2-0-pro',
'google/gemini-2-5-flash',
'google/gemini-2-5-pro',
]
for key in gemini_keys:
print(f"Model: {key}, Supports Prompt Caching: {supports_prompt_caching(model=key)}")
Provider List: https://docs.litellm.ai/docs/providers
Model: google/gemini-1-5-pro-latest, Supports Prompt Caching: False
Provider List: https://docs.litellm.ai/docs/providers
Model: google/gemini-2-0-flash, Supports Prompt Caching: False
Provider List: https://docs.litellm.ai/docs/providers
Model: google/gemini-2-0-pro, Supports Prompt Caching: False
Provider List: https://docs.litellm.ai/docs/providers
Model: google/gemini-2-5-flash, Supports Prompt Caching: False
Provider List: https://docs.litellm.ai/docs/providers
Model: google/gemini-2-5-pro, Supports Prompt Caching: False