OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

perf(condensation) Condenser that uses cache from agent

Open happyherp opened this issue 9 months ago • 14 comments

This is a work in progress.

  • [ ] This change is worth documenting at https://docs.all-hands.dev/
  • [x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality that this introduces.

Greatly reduces the cost of doing a condensation by using the cache of the llm


Give a summary of what the PR does, explaining any non-trivial design decisions.


Link of any specific issues this addresses. https://github.com/happyherp/OpenHands/issues/14

happyherp avatar Mar 30 '25 11:03 happyherp

LLM Context Condensation and Cache

Context

I have started to be able to get stuff done with OpenHands + Claude 3.7. But one thing that keeps happening, is that as soon as the conversation gets longer,

  • I run into the rate-limit. This slows me down, obviously.
  • I spend a lot of money on tokens.

But what really hits me, is when a LLM condensation happens.

Cache enables long conversations

Cost of Using Cache with Anthropic API

According to Anthropic's pricing:

  • Prompt caching write: $3.75 / MTok (just input is $3 / MTok but I will treat them as the same here. )
  • Prompt caching read: $0.30 / MTok
  • Output: $15 / MTok

This means that cached input tokens cost one-tenth as much as fresh input tokens.

Example

Imagine we start with a 10k token initial prompt. Then we have a 5 follow up prompts, that add 1000 input tokens each.

Legend:

+   1k input cache write ~0.3Cent
-   1k output ~1.5 Cent
#   1k input cache read ~0.03Cent

A regular conversation might go like this

+++-  initial prompt and response
####+- follow up
######+- regular response
########+- regular response
##########++++- big response
###############+- follow up
#################+- follow up
###################+- follow up

10 * + * 0.3 = 3 Cent

79 * # * 0.03 = 2.3 Cent

8 * - * 1.5 = 12 Cent

Total: 17.3 Cent

Imagine we did not have caching

(89 # or +) * 0.3 = 26.7 Cent

8 * - * 1.5 = 12 Cent

Total: 38.7 Cent

That's more than twice the price. So in this example caching really brings a benefit. This also matches my real life experience.

Current condensation does not use cache

Our current condensation method creates a completely new prompt, which does not take advantage of caching.

Continuing the example from above

+++-  initial prompt and response
####+- follow up
######+- regular response
########+- regular response
##########++++- big response
###############+- follow up
#################+- follow up
###################+- follow up
+++++++++++++++++++++-- condensation
###++-  conversation continues

Here a condensation reduces the context window from 21k down to 5k(of which 3k are the initial prompt).

But we pay a lot for it.

Cost of Condensation

For the condensation operation:

21 * + * 0.3 = 6.3 Cent

2 * - * 1.5 = 3 Cent

Total: 9.3 Cent

That is because we essentially now pay for every input token twice, when we could have just paid for the cache version instead(at 10% the cost. )

Condenser uses cache

We could greatly reduce the amount of new input tokens of the condensation if we could use the cache of the conversation. This can be easily achieved by

  • using the exact same prompt as before, except
  • with the condensation prompt added as the last message.

That way we will only pay full for the condensation prompt, which is just about 1k right now. Everything else should be cached.

+++-  
####+- 
######+-
########+- 
##########++++- 
###############+- 
#################+-
###################+- 
#####################+-- condensation cached with prompt last
###++-  conversation continues

1 * + * 0.3 = 0.3 Cent

21 * # * 0.03 = 0.63 Cent

2 * - * 1.5 = 3 Cent

Total: 3.93 Cent

That is less than half of the price without cache(9.3 Cent).

Implementation

I have started to implement this in this PR. Right now the focus is to see if this works as indented, rather than for the code to be perfect.

I created a new agent LLMCacheCodeAgent that inherits from CodeActAgent. Except it configures its own LLMAgentCacheCondenser. This was necessary, because the condenser needs to create the exact same prompt using the same llm as the agent. I had to add build_llm_completion_params to the CodeActAgent to be able to reuse that code from the condenser to access the prompt generation code.

The prompt for the condensation asks the AI to use this format:

KEEP: 1
KEEP: 2
KEEP: 3
REWRITE 4 TO 15 WITH:
User asked about database schema and agent explained the tables and relationships.
END-REWRITE
KEEP: 18

I choose this, because by referencing messages we just want to keep, we avoid having the llm quite the message, which would cause a lot of output.

Run instructions

  • Build the backend
  • set DEBUG=1 if you want to see details
  • and run it.
  • select LLM claude3.7(or another one that has caching) and the agent LLMAgentCacheCondenser
  • Start a conversation
  • Condensation will happen at 100 events(from llm_cache_code_agent.py) or when you put the word "CONDENSE!" in your query
  • search for "I need you to condense our conversation history to make it more efficient." in your logs directory to find the prompt_xx.log where the condensation happens

Evaluation

It would be great if we could run this against some type of benchmark where context condensation makes sense while recording the cost. It would love to know how much cost this saves in practice compared to the current condensers.

happyherp avatar Mar 30 '25 11:03 happyherp

@enyst @csmith49 I would love your feedback on this, as I have seen you are familiar with the codebase.

happyherp avatar Mar 30 '25 12:03 happyherp

This is great, I've been wanting to test this idea for a while and your description of the problem/solution is spot-on. I'll spend some more time digging into this in the upcoming week, but a few thoughts I can leave you with now:

  1. We know the condensation wipes the cache, and that we have to pay for it when we do the condensation and after, when the cache is being rebuilt with the new summary. Since the number of tokens would grow unbounded if there was no condensation, there's a break-even point where the condensation strategy pulls ahead in terms of cost. The earlier in the conversation we can push that point, the better -- this looks like it'll do exactly that.

  2. In practice I expect we'll want to avoid a new agent and just modify the condenser interface so that we get the desired behavior, but we can worry about that after evaluating this approach.

  3. In terms of evaluating the performance of this condenser, there are a few metrics we like to look for: cost (in dollars, tokens, and time) and performance impact (both qualitative and quantitative). I've got some notebooks from when I was testing the original implementations that looks at everything but the qualitative performance impact, so I'll run an evaluation and get back to you with the results!

csmith49 avatar Mar 30 '25 14:03 csmith49

Oh, very interesting! It's definitely worth looking into, maybe we can improve this. 🤔

Just some quick thoughts: We still cache the system prompt separately, but not the first message, I believe, the system prompt is here:

https://github.com/All-Hands-AI/OpenHands/blob/6d90e80c51ad876539f4f72a8ff3030e63fa6e25/openhands/memory/conversation_memory.py#L139

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

From what I understand, you propose to also cache the first keep_first messages. That seems absolutely correct... They will be sent all the time. We used to have the caching marker set explicitly on the initial user message too, but we have shuffled it around at some point. Now we cache here every step, the latest user/tool message. I'm a bit confused though: doesn't that mean that they were sent to Anthropic with the cache marker, the first time the agent went through them?

enyst avatar Mar 30 '25 14:03 enyst

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

@enyst My understanding is that anthropic requires a cache flag to be set on each message for it to be added to the cache. The need to specify comes because if the flag is off, the input tokens are slightly cheaper 3 / MTok vs $3.75 . So you could send a prompt where the few messages have caching on, followed by others that have it off. When you do not expect them to be useful for caching. OpenAI does not care. It just does caching for you. Albeit at a worse rate of only 50%. So because that is kind of an edge case and only applies to anthropic and the different between input/cache-write-input is only 25%, I decided to just not think about that and have it always on. However, we could get some savings out of them in some situations. But I was not focusing on that.

I'm a bit confused though: doesn't that mean that they were sent to Anthropic with the cache marker, the first time the agent went through them?

From what I understand yes, when caching is on, all messages get the caching flag. Even the ones used during condensation. But the condensation was not able to neither

  • use the previous cached entries
  • create entries that could be useful for caching itself because the beginning of the condensation prompt and the regular prompt(which includes tool use instructions) are different, they have a different PREFIX(thats what they call it) so caching does not happen. That is what I changed by moving the condensation prompt to the end, while keeping everything else the same.

@csmith49

We know the condensation wipes the cache, and that we have to pay for it when we do the condensation and after, when the cache is being rebuilt with the new summary. Since the number of tokens would grow unbounded if there was no condensation, there's a break-even point where the condensation strategy pulls ahead in terms of cost. The earlier in the conversation we can push that point, the better -- this looks like it'll do exactly that.

Yes. I think there is whole art to how and when you create a summary. Which must be balanced with cache. I agree that we probably want to do it a lot earlier than after 100 messages. I believe, that it might even be a good idea to request a condensation of just the last observation, if it is above a certain size. That way, we could still reuse the cache of the conversation so far. Doing this consistently, would keep all the huge observations of out the context-window. I am looking at you, translation.json and poetry run pytest.

In practice I expect we'll want to avoid a new agent and just modify the condenser interface so that we get the desired behavior, but we can worry about that after evaluating this approach.

Yes. Otherwise it is impossible to take advantage of caching.

In terms of evaluating the performance of this condenser, there are a few metrics we like to look for: cost (in dollars, tokens, and time) and performance impact (both qualitative and quantitative). I've got some notebooks from when I was testing the original implementations that looks at everything but the qualitative performance impact, so I'll run an evaluation and get back to you with the results!

👍

happyherp avatar Mar 30 '25 15:03 happyherp

Just curious, does this still happen? That is, running in debug mode, after the first condensation, do we still see in the logs that some caching was applied or is it all a cache write?

@enyst My understanding is that anthropic requires a cache flag to be set on each message for it to be added to the cache.

Just to clarify what I meant here, it needs it on the last message that we want cached. It will then cache all prompt, which implies all the previous ones, from the beginning.

This is my understanding from Anthropic's documentation. For example, https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#prompt-caching-examples

The cache_control parameter is placed on the system message to designate it as part of the static prefix.

During each turn, we mark the final message with cache_control so the conversation can be incrementally cached. The system will automatically lookup and use the longest previously cached prefix for follow-up messages.

This is what we are doing. Every step, we cache the system prompt, and the last message.

According to Anthropic though, if the last message suddenly didn't have the marker or would not be found or since it's the first time we sent it, they would lookup and use "the longest previously cached prefix".

Edited to add: That's why I was asking, doesn't it find the system message at least? If the answer is no... I'm curious why. I see, the PR changed the order... that seems smart! 🤔 How about tools?

enyst avatar Mar 30 '25 16:03 enyst

Wait with your evaluation @csmith49 The current condensation is buggy. The wrong events are being removed, because I assumed the list of events to match the list of messages, which is naive.

happyherp avatar Mar 30 '25 19:03 happyherp

Wait with your evaluation @csmith49 The current condensation is buggy. The wrong events are being removed, because I assumed the list of events to match the list of messages, which is naive.

No worries, just tag me here when you're ready for me to give it a spin.

csmith49 avatar Mar 30 '25 20:03 csmith49

@csmith49 I think now its worth trying to get it to run.

Put this in the config.toml

[agent.CodeActAgent.condenser]
type = "agentcache"
trigger_word = "!CONDENSE!"
max_size = 50

I tested it with claude3.7. It sometimes makes bad choices with what it rememers/forgets. But the goal here was to avoid cache-writes mainly. So lets see if it does that properly.

happyherp avatar Apr 07 '25 18:04 happyherp

@happyherp I finished taking a look at the data.

To test this condenser, I ran three runs over a subset of 50 SWE-bench instances with a max of 150 iterations and an automatic condensation at 40 iterations. Here's what I found:

Pre-condensation, the cache-reuse condenser shows similar performance to the baseline and the current best performing condenser. Post-condensation...things go off the rails.

visualization

I'm not 100% sure what is happening in this figure -- I'm plotting a smoothed average cost per iteration, and it looks like after the condensation triggers the cache reuse agent starts spending like crazy. The token consumption is about the same as the current condenser, but those tokens seem to cost a lot more.

Could just be noise: the agent usually decides to stop immediately after a condensation. I'm only seeing four resolved tasks after the condensation trigger over all three runs.

This does result in a lower overall average cost-per iteration, But that may be because iterations aren't equal (the earlier ones are way cheaper than the later ones) and the cache reuse agent is very "front-loaded":

strategy resolved avg. iteration avg. cost per iteration
baseline 53% 49 $0.025
cache 47% 35 $0.020
condenser 54% 56 $0.022

Note the hit to resolution rate as well. There's definitely room to trade cost with performance, but probably not at this scale.

I think this mirrors your observation that the agent does not produce quality summaries. With so few trajectories making it past the first condensation I don't think we can trust these numbers much -- that big spike in cost might be an artifact that doesn't impact the average.

csmith49 avatar Apr 14 '25 15:04 csmith49

Yeah. That definetly looks like it is not working. Also it seems you fixed this already anyways with the changes to the condenser you did over the last weeks. I have not looked into it to deeply, and don't understand, why the current Condenser implementation does not cause a cache-miss anymore.

But since I am running the latest version, I do not see the artifacts I used to have.

Before

image

After the condensation update

image

That was, what I was trying to fix here. Seems to be working. Would be nice, to understand how, but good job @csmith49 👏

happyherp avatar Apr 15 '25 08:04 happyherp

I think it might have been #7781 -- there's still a big token consumption spike when the summaries are produced because we can't use the cache (if you look at prompt_tokens in your graph you'll likely see it), but we're not trying to use the cache so we save ~20%.

csmith49 avatar Apr 15 '25 13:04 csmith49

I looked into that. Turns out there are 2 llm-calls happening during a condensation by LLMSummarizingCondenser and their token metrics are not included in the trajectory.json. I made an issue for that: https://github.com/All-Hands-AI/OpenHands/issues/7879

So that is why I saw no spikes in my graph.

happyherp avatar Apr 16 '25 10:04 happyherp

I looked into that. Turns out there are 2 llm-calls happening during a condensation by LLMSummarizingCondenser and their token metrics are not included in the trajectory.json. I made an issue for that: #7879

Nice work on getting that issue resolved. You're right that those metrics aren't propagated like the rest. For the graphs I generated above I'm computing the cost from the logged LLM completions -- these are automatically produced by our evaluation scripts, and I've confirmed that they do contain all the LLM calls, including the ones used for summarization.

csmith49 avatar Apr 17 '25 16:04 csmith49

I am closing this in favour of https://github.com/All-Hands-AI/OpenHands/pull/7893 which looks more promising.

happyherp avatar Apr 21 '25 13:04 happyherp