litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Bug]: Cache control injection points for Anthropic/Bedrock

Open kresimirfijacko opened this issue 7 months ago • 6 comments

What happened?

I followed recent changes about support for cache control injection points for Anthopic API: https://github.com/BerriAI/litellm/pull/9996

So far it works good, but i stumbled on some things that i don't fully understand, maybe due to documentation: https://docs.litellm.ai/docs/tutorials/prompt_caching Maybe even more important: https://docs.litellm.ai/docs/completion/prompt_caching#anthropic-example

The conversation history (previous messages) is included in the messages array. The final turn is marked with cache-control, for continuing in followups. The second-to-last user message is marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.

I would like to be able to configure everything in LiteLLM proxy, mainly for long chat conversations: If i understood documentation, i would need cache set on second-to-last and last message. However that is not possible in llm proxy configuration In case there is only role set (ex: user), then bedrock returns exception (as it can have maximum of 4 messages set to cache). In case there is index set, this can be only: if 0 <= targetted_index < len(messages): and means it's not quite possible to set -1 or something like that to target last messages for cache. as per code in: https://github.com/BerriAI/litellm/blob/f5996b2f6ba45ec3859a716e28f6e6eff0f7a0b3/litellm/integrations/anthropic_cache_control_hook.py#L42

I believe there should be options in litellm proxy config that should make this possible out of the box.

Relevant log output


Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.67.0-stable

Twitter / LinkedIn details

No response

kresimirfijacko avatar Apr 23 '25 10:04 kresimirfijacko

Is the ask here to support -1 as an index ? so we'll always insert the control on the last message

ishaan-jaff avatar Apr 23 '25 14:04 ishaan-jaff

Is the ask here to support -1 as an index ? so we'll always insert the control on the last message

I am not quite sure. From this documentation: https://docs.litellm.ai/docs/providers/anthropic#caching---continuing-multi-turn-convo

it seems it is necessary on last and second-to-last

kresimirfijacko avatar Apr 23 '25 15:04 kresimirfijacko

lets take a step back, what is your goal with using cache control injection @kresimirfijacko ?

ishaan-jaff avatar Apr 23 '25 15:04 ishaan-jaff

lets take a step back, what is your goal with using cache control injection @kresimirfijacko ?

good question... goals:

  1. 'static prompt caching' - for prompts that always have same system message (that is long) i want to cache this system prompts. this is easily achievable through proxy configuration
  2. 'multi turn conversation' - for example chat where there is some long document and user has multiple interactions with LLM. if i understood correctly, if there is cache on these messages, user should have better experience (reduced latency, overall reduced costs for longer conversations). am i missing something?

kresimirfijacko avatar Apr 23 '25 15:04 kresimirfijacko

  1. can you confirm, you can achieve this now
  2. what do you want to cache here ? user message? assistant message ? (can you show me where you'd want litellm to insert cache controls) ?

ishaan-jaff avatar Apr 24 '25 04:04 ishaan-jaff

  1. can you confirm, you can achieve this now

    1. what do you want to cache here ? user message? assistant message ? (can you show me where you'd want litellm to insert cache controls) ?
  1. i confirm it works
  2. I will copy paste example of what i want in multi turn conversation, and that is that conversation is constantly cached...

I made several requests simulating chat conversation. Messages are multiplied by factor 200 to trigger caching on anthropic side (smaller messages don't get cached). Below every request i wrote output of response.usage Is my example clear? Is this even possible? This behaviour i get if i mark last and second-to-last message for caching.

from openai import OpenAI

client = OpenAI(
    api_key='',
    base_url='',
)


response = client.chat.completions.create(
    model="unified",
    messages = [
        {
            "role": "system",
            "content": "You are a helpful weather assistant."
        },
        {
            "role": "user",
            "content": "Can you give me the weather forecast for London today?" * 200,
            "cache_control": {
                "type": "ephemeral",
            }
        },
    ]
)
print(response.usage.model_extra)
# {'cache_creation_input_tokens': 2210, 'cache_read_input_tokens': 0}


response = client.chat.completions.create(
    model="unified",
    messages = [
        {
            "role": "system",
            "content": "You are a helpful weather assistant."
        },
        {
            "role": "user",
            "content": "Can you give me the weather forecast for London today?" * 200,
            "cache_control": {
                "type": "ephemeral",
            }
        },
        {
            "role": "assistant",
            "content": "Certainly! For London today, the forecast is partly cloudy with a high of 18 degrees Celsius and a low of 9 degrees Celsius. There's a 20% chance of rain in the afternoon."
        },
        {
            "role": "user",
            "content": "Will I need a jacket?" * 200,
            "cache_control": {
                "type": "ephemeral",
            }
        },
    ]
)
print(response.usage.model_extra)
{'cache_creation_input_tokens': 1253, 'cache_read_input_tokens': 2210}



response = client.chat.completions.create(
    model="unified",
    messages = [
        {
            "role": "system",
            "content": "You are a helpful weather assistant."
        },
        {
            "role": "user",
            "content": "Can you give me the weather forecast for London today?" * 200
        },
        {
            "role": "assistant",
            "content": "Certainly! For London today, the forecast is partly cloudy with a high of 18 degrees Celsius and a low of 9 degrees Celsius. There's a 20% chance of rain in the afternoon."
        },
        {
            "role": "user",
            "content": "Will I need a jacket?" * 200,
            "cache_control": {
                "type": "ephemeral",
            }
        },
        {
            "role": "assistant",
            "content": "Given the low of 9 degrees Celsius, especially in the evening, I would recommend bringing a light jacket or sweater. It could get a bit chilly once the sun goes down."
        },
        {
            "role": "user",
            "content": "Anything unusual about the weather pattern today, like strong winds or high UV index?" * 200,
            "cache_control": {
                "type": "ephemeral",
            }
        },
    ]
)
print(response.usage.model_extra)
# {'cache_creation_input_tokens': 3446, 'cache_read_input_tokens': 3463}
# cache_read_input_tokens is 3463, which is from the previous request: 1253+2210=3463

kresimirfijacko avatar Apr 24 '25 11:04 kresimirfijacko

I am also encountering this issue with supported AWS Bedrock Models.

I tried to follow Anthropic’s example at the very end of their cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb

However, this does not currently work on Bedrock. As @kresimirfijacko stated, the cache_control point needs to be placed at the last and second-to-last message for the caching to work on Bedrock. Just marking the last message with cache_control will always write the whole message history to the cache again, rather than just the latest message(s).

At the moment, the cache_control_injection_points do not support a configuration that makes this possible.

@ishaan-jaff Maybe we could allow passing a list of indeces such as: {"role": "user", "location": "message", "index": [-1, -2]}, to inject the points to the last and second-to-last messages of the specified role.

Does this sound reasonable to you?

makefinks avatar May 14 '25 10:05 makefinks

I'm also blocked from using prompt caching. My setup is fairly simple: an agent using openai agents sdk:

        super().__init__(
            name="writer",
            instructions="""You are an expert novelist. 
...
""",
            model=LitellmModel(
                model="openai/claude-sonnet-4-cached",
                api_key="loomscape-dev-key",
                base_url=get_proxy_url()
            ),
            tools=[...]
        )

litellm_config.yaml:

model_list:
  - model_name: claude-sonnet-4-cached
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      cache_control_injection_points:
        - location: message
          role: assistant

general_settings:
  master_key: loomscape-dev-key

even with a single cache injection point multi-turn conversation soon runs out of cache blocks and fails:

Error running agent: litellm.BadRequestError: OpenAIException - litellm.BadRequestError: AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"A maximum of 4 blocks with cache_control may be provided. Found 5."}}. Received Model Group=claude-sonnet-4-cached
Available Model Group Fallbacks=None

@makefinks , what you're trying to do with indexes would be helpful for my usecase.


as a side note, I couldn't find a way to set up prompt caching injection via litellm SDK, only via proxy, which is not convenient for local agentic app

Aivean avatar Jun 16 '25 02:06 Aivean

I got prompt caching working for Bedrock Claude Sonnet 4 via LiteLLM python SDK. Make sure your prompt is greater than Bedrock's minimum token cache limit of 1024 tokens. All you need is the following parameter in the request:

        if self.system_prompt_support:
            params["cache_control_injection_points"] = [
                {"location": "message", "role": "system"}
            ]

        generator = await litellm.acompletion(
            model=self.model_id,
            messages=built_prompt.messages,
            stream=True,
            api_base=self.api_base,
            api_key=self.api_key,
            stream_options={"include_usage": True},
            **params,
        )

Also see that I passed stream_options={"include_usage": True}, which adds the token cached usage data to the response. Then you can view the cached tokens total typically at the end of the stream of data.

porteron avatar Jun 18 '25 00:06 porteron

@porteron

{"location": "message", "role": "system"}

this would cache only the system prompt, though. if there were are back-and-forth with user/tools after that, they are not cached.

Aivean avatar Jun 18 '25 03:06 Aivean

Would appreciate this enhancement too. I'm new to this lib, haven't signed CLA, so if someone's contributed before, feel free to use this working monkeypatch for inspiration:

from litellm.integrations.anthropic_cache_control_hook import (
    AnthropicCacheControlHook,
)
from litellm.types.llms.openai import ChatCompletionCachedContent


def _allow_negative_index(point, messages):
    """
    Accept index = -1, -2 … and default to the last message
    when neither index nor role is given.
    """
    control = point.get("control") or ChatCompletionCachedContent(type="ephemeral")

    idx = point.get("index")
    role = point.get("role")

    if idx is None and role is None:
        idx = -1

    if idx is not None:
        if idx < 0:
            idx = len(messages) + idx          # -1 → last, -2 → second-last
        if 0 <= idx < len(messages):
            AnthropicCacheControlHook._safe_insert_cache_control_in_message(
                messages[idx], control
            )
        return messages

    for msg in messages:
        if msg.get("role") == role:
            AnthropicCacheControlHook._safe_insert_cache_control_in_message(
                msg, control
            )
    return messages


AnthropicCacheControlHook._process_message_injection = staticmethod(
    _allow_negative_index
)

jamding avatar Jul 09 '25 23:07 jamding

I am working towards a fix for this(Part of Amazon)

AnandKhinvasara avatar Jul 26 '25 02:07 AnandKhinvasara