perf(condensation) Condenser that uses cache from agent - Prompt like LLMSummaryCondenser
This was an experiment that branched off from https://github.com/All-Hands-AI/OpenHands/pull/7588
Instead of trying to come up with a new type of condensation-prompt that uses message-index, I use the exact same prompt as LLMSummaryCondenser.
- [ ] This change is worth documenting at https://docs.all-hands.dev/
- [x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
End-user friendly description of the problem this fixes or functionality that this introduces.
Greatly reduces the cost of doing a condensation by using the cache of the llm
Give a summary of what the PR does, explaining any non-trivial design decisions.
Link of any specific issues this addresses. https://github.com/happyherp/OpenHands/issues/14
@csmith49 Hey. Can you re-run your test with this? This should be just the same as LLMSummaryCondenser, except with cache-use.
Test: Early Condensation by Keyword
Setup
Rev: 0c55c9acec8cd257bd75a563b0cca1520cdfd782
in config.toml:
[agent.codeactagent]
condenser = "LLMCacheAgentCondenser"
max-size = 50
Env:
DEBUG=1
Started as Web-Application from VSCode
Test
- Start new conversation "Hi."
- Agent answers greeting
- User: "CONDENSE!"
- Agent responds kind of awkward to the hidden(to the user) condensation summary
Result
from the trajectory: trajectory-a8c2c67ffb48427f9e31602f30e22fc3.json
Agent Greeting
This is the last llm-call before the condensation
{
"id": 7,
"timestamp": "2025-04-23T13:21:05.093152",
"source": "agent",
"message": "Hello! I'm OpenHands, your AI assistant. I'm here to help you with various tasks like executing commands, modifying code, solving technical problems, and more.\n\nIs there something specific you'd like me to help you with today? I can:\n\n- Explore and navigate file systems\n- Execute bash commands\n- Edit and create files\n- Run Python code\n- Help with web development tasks\n- Assist with troubleshooting\n- And much more!\n\nJust let me know what you need, and I'll be happy to assist you.",
"action": "message",
"llm_metrics": {
"accumulated_cost": 0.022965750000000004,
"accumulated_token_usage": {
"model": "anthropic/claude-3-7-sonnet-20250219",
"prompt_tokens": 4,
"completion_tokens": 121,
"cache_read_tokens": 0,
"cache_write_tokens": 5637,
"response_id": ""
},
"costs": [],
"response_latencies": [],
"token_usages": []
},
"args": "..."
},
This is the first llm-interaction, which is why there is no cache_read_tokens.
User-Message
{
"id": 9,
"timestamp": "2025-04-23T13:21:23.338160",
"source": "user",
"message": "CONDENSE!",
"action": "message",
"args": {
"content": "CONDENSE!",
"image_urls": [],
"wait_for_response": false
}
},
Condensation-Event
{
"id": 13,
"timestamp": "2025-04-23T13:21:27.247753",
"source": "agent",
"message": "Summary: I'll create a concise state summary for our conversation:\n\nUSER_CONTEXT: No specific task or requirements provided yet. Initial greeting only.\n\nCOMPLETED: None\n\nPENDING: Awaiting user's specific request or task\n\nCURRENT_STATE: \n- Available ports: 52274, 55560\n- Web server configuration requirements: Allow iframes, CORS requests, and access from any host (0.0.0.0)\n- Current date: 2025-04-23 (UTC)",
"action": "condensation",
"llm_metrics": {
"accumulated_cost": 0.028268850000000005,
"accumulated_token_usage": {
"model": "anthropic/claude-3-7-sonnet-20250219",
"prompt_tokens": 6110,
"completion_tokens": 236,
"cache_read_tokens": 5637,
"cache_write_tokens": 5765,
"response_id": ""
},
"costs": [],
"response_latencies": [],
"token_usages": []
},
"args": {
"forgotten_event_ids": [2, 4, 6, 7, 9],
"forgotten_events_start_id": null,
"forgotten_events_end_id": null,
"summary": "I'll create ... <same as above>",
"summary_offset": null
}
},
Note: cache_read_tokens is 5637, just as cache_write_tokens is in the previous llm-interaction(id=7). So the cache was used. ✔️
But we also do "cache_write_tokens": 5765,, which creates a new cache for the condensation-prompt, that we are never going to use, because we are just doing the condensation and the next ll-call will not match our prefix.
So I should fix that, and deactivate caching for the condensation-call itself. [ ]
Agent Response after Condensation
{
"id": 14,
"timestamp": "2025-04-23T13:21:31.049208",
"source": "agent",
"message": "Thank you for providing the state summary. I understand that you're waiting to provide a specific task or request. I'm ready to assist you when you have a task in mind. \n\nThe information you've shared about available ports (52274, 55560), web server configuration requirements (allowing iframes, CORS requests, and access from any host), and the current date (2025-04-23 UTC) will be helpful when we start working on your task.\n\nPlease let me know what you'd like me to help you with, and I'll use the appropriate tools to assist you.",
"action": "message",
"llm_metrics": {
"accumulated_cost": 0.03271860000000001,
"accumulated_token_usage": {
"model": "anthropic/claude-3-7-sonnet-20250219",
"prompt_tokens": 11444,
"completion_tokens": 363,
"cache_read_tokens": 10967,
"cache_write_tokens": 6014,
"response_id": ""
},
"costs": [],
"response_latencies": [],
"token_usages": []
},
"args": {
"content": "Thank you for providing the state summary. I understand that you're waiting to provide a specific task or request. I'm ready to assist you when you have a task in mind. \n\nThe information you've shared about available ports (52274, 55560), web server configuration requirements (allowing iframes, CORS requests, and access from any host), and the current date (2025-04-23 UTC) will be helpful when we start working on your task.\n\nPlease let me know what you'd like me to help you with, and I'll use the appropriate tools to assist you.",
"image_urls": null,
"wait_for_response": true
}
},
accumulated_token_usage(n) - accumulated_token_usage(n-1) :
prompt_tokens: 11444 - 6110 = 5334
completion_tokens: 363 - 236 = 127
cache_read_tokens: 10967 - 5637 = 5330 (system prompt still in cache)
cache_write_tokens: 6014 - 5765 = 249 (this should be the summary)
Condensation-Prompt
This is the prompt of the condensation.
prompt003.log
You are OpenHands agent, a helpful AI assistant that can interact with a computer to solve tasks.
<ROLE>
..... SNIP ....
* When you run into any major issue while executing a plan from the user, please don't try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding.
</TROUBLESHOOTING>
----------
hi
----------
<RUNTIME_INFORMATION>
... SNIP ....
</RUNTIME_INFORMATION>
----------
Hello! I'm OpenHands, your AI assistant. I'm here to help you with various tasks like executing commands, modifying code, solving technical problems, and more.
Is there something specific you'd like me to help you with today? I can:
- Explore and navigate file systems
- Execute bash commands
- Edit and create files
- Run Python code
- Help with web development tasks
- Assist with troubleshooting
- And much more!
Just let me know what you need, and I'll be happy to assist you.
----------
CONDENSE!
----------
You are maintaining a context-aware state summary for an interactive agent.
The whole conversation above will be removed from the context window. Therefore you need to track:
USER_CONTEXT: (Preserve essential user requirements, goals, and clarifications in concise form)
.... SNIP ....
For other tasks:
USER_CONTEXT: Write 20 haikus based on coin flip results
COMPLETED: 15 haikus written for results [T,H,T,H,T,H,T,T,H,T,H,T,H,T,H]
PENDING: 5 more haikus needed
CURRENT_STATE: Last flip: Heads, Haiku count: 15/20
The llm is sent the whole conversation, including the message that triggered the condensation.
Condensation-Completion
response_003.log
I'll create a concise state summary for our conversation:
USER_CONTEXT: No specific task or requirements provided yet. Initial greeting only.
COMPLETED: None
PENDING: Awaiting user's specific request or task
CURRENT_STATE:
- Available ports: 52274, 55560
- Web server configuration requirements: Allow iframes, CORS requests, and access from any host (0.0.0.0)
- Current date: 2025-04-23 (UTC)
This is a decent summary, I would say.
Continuation of Conversation
The next llm call after the condensation
prompt_004.log
You are OpenHands agent, a helpful AI assistant that can interact with a computer to solve tasks.
<ROLE>
Your primary role is to assist users by executing commands, modifying code, and solving technical problems effectively. You should be thorough, methodical, and prioritize quality over speed.
* If the user asks a question, like "why is X happening", don't try to fix the problem. Just give an answer to the question.
</ROLE>
.... SNIP ...
3. Methodically address the most likely causes, starting with the highest probability
4. Document your reasoning process
* When you run into any major issue while executing a plan from the user, please don't try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding.
</TROUBLESHOOTING>
----------
I'll create a concise state summary for our conversation:
USER_CONTEXT: No specific task or requirements provided yet. Initial greeting only.
COMPLETED: None
PENDING: Awaiting user's specific request or task
CURRENT_STATE:
- Available ports: 52274, 55560
- Web server configuration requirements: Allow iframes, CORS requests, and access from any host (0.0.0.0)
- Current date: 2025-04-23 (UTC)
Note: The original user-message is gone. I did intent that at least the first user-message is always kept. So that is wrong ❌
Summary
It basically worked. But I noted todos:
- make-sure the original, first user-message is always kept in the conversation
- do not enable cache for the condensation completion
That looks awesome, thank you! ❤️
Just a quick thought:
make-sure the original, first user-message is always kept in the conversation
You are correct, and I would suggest that maybe this could be keep_first messages, so that the condenser works the same way with LLMSummarizer:
https://github.com/All-Hands-AI/OpenHands/blob/5de62d85fdcde61080b23c979e872674a792c31a/openhands/memory/condenser/impl/llm_summarizing_condenser.py#L52
Re-Test: Disable Cache and Keep user message
Same setup as before, but on rev 6c456570b9d529cdd7348dcb3090630c287ec2c1
21:44:40 - openhands:DEBUG: logger.py:445 - Logging to /home/ubuntu/projects/MyOpenHands/logs/llm/25-04-23_21-38/response_003.log
21:44:40 - openhands:DEBUG: llm.py:619 - Cost: 0.02 USD | Accumulated Cost: 0.04 USD
Response Latency: 5.220 seconds
Input tokens: 6226 | Output tokens: 156
21:44:40 - openhands:INFO: caching_condenser.py:90 - Summarized 7 events. Usage:ModelResponse(id='chatcmpl-3e8cc8aa-909d-4737-84df-8b9a7b9c9b48', created=1745437480, model='claude-3-7-sonnet-20250219', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"thought": "I need to create a concise state summary for the conversation so far. Let me analyze what has happened:\\n\\n1. The user just said \\"hi\\" as an initial greeting\\n2. I responded with a general introduction and offered to help with various tasks\\n3. No specific tasks have been started or completed yet\\n4. No code has been examined or modified\\n5. No specific user context or requirements have been established\\n\\nThis is just the beginning of the conversation, so the state summary will be minimal."}', name='think'), id='toolu_01T7dZ5WAHYsUjxoHgK3HniN', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=156, prompt_tokens=6226, total_tokens=6382, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
cache_creation_input_tokens=0 is great, that's what I wanted to change. But there is also cache_read_input_tokens=0 which is what the whole ticket is trying to avoid...
Ah, caching is disabled on that prompt. Maybe what we could do there, is to enable it, but make sure we set the cache marker not on the last message as usual prompts, but on the last message before the condensation prompt...
Re-Test#2
Same setup but on rev. d00a99fbdbb681992069bd32685cfe9ff0048c13 only disable cache on the last message.
Condenser uses Cache, does not write Cache
22:31:56 - openhands:INFO: caching_condenser.py:90 - Summarized 7 events. Usage:ModelResponse(id='chatcmpl-7ad90a8e-1473-46e8-8be3-0e9f0d36c937', created=1745440316, model='claude-3-7-sonnet-20250219', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content="I'll create a concise state summary for our conversation:\n\nUSER_CONTEXT: No specific task or requirements provided yet. User has just initiated the conversation.\n\nCOMPLETED: None\n\nPENDING: Awaiting user's specific request or task\n\nCURRENT_STATE: \n- User has access to web application hosts:\n * http://localhost:51826 (port 51826)\n * http://localhost:57648 (port 57648)\n- Today's date is 2025-04-23 (UTC)\n- When starting a web server, should use corresponding ports with options to allow iframes, CORS requests, and access from any host (0.0.0.0)\n\nVERSION_CONTROL_STATUS: Not established yet", role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=168, prompt_tokens=6107, total_tokens=6275, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=5638, text_tokens=None, image_tokens=None), cache_creation_input_tokens=91, cache_read_input_tokens=5638))
prompt_tokens 6107
- cache_creation_input_tokens 91
- cache_read_input_tokens 5638
= 378
So we got a cache read(good!), a small cache-creation(unexpected) and a difference of 378 input-tokens that appear neither as cache_creation_input_tokens, nor cache_read_input_tokens. So that would be the condensation prompt.
My guess is the 91 cache_creation_input_tokens are from my message ("CONDENSE!"), which is not cached yet. We should also not cache it. But now it is not so clear how far back i should go and disable caches before I run into things that should have the cache-flag set.
Keep user message
promt after condensation
... standart openhands prompt...
<TROUBLESHOOTING>
* If you've made repeated attempts to solve a problem but tests still fail or the user reports it's still broken:
1. Step back and reflect on 5-7 different possible sources of the problem
2. Assess the likelihood of each possible cause
3. Methodically address the most likely causes, starting with the highest probability
4. Document your reasoning process
* When you run into any major issue while executing a plan from the user, please don't try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding.
</TROUBLESHOOTING>
----------
Hi.
----------
I'll create a concise state summary for our conversation:
USER_CONTEXT: No specific task or requirements provided yet. User has just initiated the conversation.
COMPLETED: None
So here we are missing the message with "CONDENSE!".
Test: Keep first messages
Setup
commit: cdbbc1e: Merge branch 'upstream-main' into condenser_experiment
webapp, launch from vscode, docker-runtime
Test
After starting, I set a breakpoint in CodeActAgent.step, so that I can change the settings of the Condenser to keep_first=20, max_size=50.
I set another breakpoint in LLMAgentCacheCondenser.processResponse so I can watch the condensation.
I give the agent a task and wait for the condensation breakpoint to be hit.
Log:
13:20:59 - openhands:DEBUG: logger.py:445 - Logging to /home/ubuntu/projects/MyOpenHands/logs/llm/25-04-24_13-08/response_020.log
13:20:59 - openhands:DEBUG: llm.py:619 - Cost: 0.02 USD | Accumulated Cost: 0.34 USD
Response Latency: 7.599 seconds
Input tokens: 44579 | Output tokens: 275
Input tokens (cache hit): 44104
Input tokens (cache write): 192
trajectory
{
"id": 52,
"timestamp": "2025-04-24T13:24:37.185593",
"source": "agent",
"message": "Summary: I'll create a concise state summary for our current task:\n\nUSER_CONTEXT: Add tests to the codebase which contains prime number utilities and a vocabulary trainer\n\nCOMPLETED: \n- Explored the workspace to understand the codebase\n- Created a tests directory\n\nPENDING: \n- Create test files for prime_numbers.py\n- Create test files for vocab_trainer.py\n- Create test files for app.py (Flask API)\n- Create test files for generate_primes_to_file.py\n\nCODE_STATE:\n- prime_numbers.py: Contains is_prime(), sieve_of_eratosthenes(), print_primes()\n- vocab_trainer.py: VocabularyTrainer class with methods for managing vocabulary\n- app.py: Flask app with endpoints for prime number operations\n- generate_primes_to_file.py: Script to generate prime numbers to a file\n\nTESTS: No tests implemented yet\n\nCHANGES: Created /workspace/tests directory\n\nDEPS: \n- Flask, flask_cors for app.py\n- json, os, random, sys, re, datetime, difflib for vocab_trainer.py\n\nVERSION_CONTROL_STATUS: No version control information available",
"action": "condensation",
"llm_metrics": {
"accumulated_cost": 0.3389943,
"accumulated_token_usage": {
"model": "anthropic/claude-3-7-sonnet-20250219",
"prompt_tokens": 439309,
"completion_tokens": 2639,
"cache_read_tokens": 438751,
"cache_write_tokens": 44296,
"response_id": ""
},
"costs": [],
"response_latencies": [],
"token_usages": []
},
"args": {
"forgotten_event_ids": [ 27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,48,49,50,51
"forgotten_events_start_id": null,
"forgotten_events_end_id": null,
"summary": "I'll create a concise <same as above>",
"summary_offset": null
}
},
The forgotten event ids show that events have definetly been skipped. But we said to skip 20, why do the ids start at 27?
Prompt Log Before and after condensation
We can see how on the right side, the condensation summary is shown, where on the left, there is another file-read.
But we said to skip 20, why do the ids start at 27?
Perhaps no issue here, because these are event ids, and there are events in the event stream which are not added in agent's history, such as some AgentChangeObservation, irrelevant for the agent.
Perhaps no issue here, because these are event ids, and there are events in the event stream which are not added in agent's history, such as some AgentChangeObservation, irrelevant for the agent.
Makes sense, that's probably it.
I would really like to know how this version performs on a benchmark. Unfortunately I can not run those very well myself.
Do you see any more todos to get this merged?
The one thing that bothers me, is that for webapp operation, the config.toml is not used. So to use this condenser I had to hardcode it as the default. https://github.com/All-Hands-AI/OpenHands/pull/7893/commits/6a1f5fd812ac6b2d2848895c548a9f640785642b Ideally, there would be some way in the UI or somewhere else for the user to choose the exact condenser to be used.
Thank you, I think your logs show that the PR is working! Personally, I think we did all we could, now we can run eval.
Re: the web server doesn't read config.toml, it hardcodes values, yes. We should fix this, but for now IMHO it's fine to also hardcode for a test run.
I would really like to know how this version performs on a benchmark. Unfortunately I can not run those very well myself.
I can run evals pretty well, part-locally, partly on the remote runtime, but I don't know Calvin's notebooks and what he's looking at, nor which 50 subset exactly. I mostly run them as initial evals, sanity checks, or debugging. I'm not sure that would be useful here?
@csmith49 We need your help here. 🙏 What do you think?
I would really like to know how this version performs on a benchmark. Unfortunately I can not run those very well myself.
I can run evals pretty well, part-locally, partly on the remote runtime, but I don't know Calvin's notebooks and what he's looking at, nor which 50 subset exactly. I mostly run them as initial evals, sanity checks, or debugging. I'm not sure that would be useful here?
Thanks @enyst and @happyherp, I'll get a few evaluation runs going and report back with some graphs. The two main factors I'm looking at will be 1) impact to resolution rate and 2) average cost per-step.
Okay, I've got the data in.
Setup
I'm specifically comparing this condenser to the runs reported in the blog post here. That means a few options are standardized, and may not be optimal for actual usage.
I'm comparing this Cache strategy to the Baseline and Condenser strategies, using the same configuration in this blog post.
Specifically, I'm setting keep_first=4 and max_size=80 for the condensers, and running on a randomly-selected (but consistent across the strategies) set of 50 SWE-Bench Verified instances with 150 max iterations.
This subset correlates well with performance on the full data set, and generates trajectories with a good spread of lengths. Reported results are averaged over three runs on this subset.
The Results
Some summary results first:
| strategy | resolved | avg. cost | avg. iteration | avg. cost-per-iter. |
|---|---|---|---|---|
| baseline | 52.7% | $1.22 | 49.5 | $0.025 |
| cache | 47.3% | $4.04 | 56.6 | $0.071 |
| condenser | 54.0% | $1.21 | 55.8 | $0.022 |
The caching strategy hits the resolution rate pretty hard, and seems to dramatically increase the average cost-per-iteration while doing so. If we break out the average cost vs. iteration we start to see what's happening:
There's big spikes in cost that correspond to the condensation phases. If we break out the different token usages:
Then we see the cost comes from the fact that after condensation there aren't any cache reads and we have way too many cache writes. I'm guessing the evaluation entry-point I used isn't setting up the agent correctly. I used this script and modified the metadata to include the condenser config.
Conclusions
Whoops.
Still, these runs tell us a few things. They tell us the condensation strategy is impacting the resolution rate -- just looking at the token consumption per iteration shows a pretty big difference in how often and how aggressively the condensation is happening between the Cache and Condenser strategies.
For future tests I'll want to minimize the number of differences between the two strategies to just highlight how changing the summary generation impacts cost and performance. That probably means tweaking the Condenser strategy to forget just as aggressively and/or the Cache strategy to be less aggressive.
The runs also give us a clue as to what the performance cost could be were things configured correctly on my end. I went through and swapped the cache reads and writes post-condensation for the Cache strategy -- this still isn't quite fair to Cache, but it brings the average cost-per-iteration down to $0.027, which is much closer to being competitive.
Take-Aways
I'm glad to run more experiments if y'all think it would be fruitful. I'm a little concerned about the performance gap, but maybe with some tweaks to the prompt or the amount of forgotten events that can be closed without impacting cost too much?
Also a little concerned with needing extra configuration steps to get working (we've got something like 4 different configuration pipelines). Outside the scope of the current exploration, for sure, but something we'll need to resolve at some point.
Oh, wow. Thank you Calvin!
Then we see the cost comes from the fact that after condensation there aren't any cache reads and we have way too many cache writes. I'm guessing the evaluation entry-point I used isn't setting up the agent correctly. I used this script and modified the metadata to include the condenser config.
This and the graph above really look to me like what is happening is: the cache agent's history is the system prompt and maybe the first message, not much else, then something changes every step, then come the rest of the messages.
I think that would explain the growing number of cache writes, and the same small number of cache reads. I may be wrong, please consider it a guess.
I mean something changes literally: I think we'd see this graph if the content of a message soon after the system message changes, because that invalidates the cache for the rest of the context after it. One of the first five messages maybe.
An alternative explanation could be something like this. I just saw we currently don't use that anymore. When we used it: the reminder phrase contained a variable, which was incremented every step. That cannot be part of the cached prompt, because if it was, it would write the cache, and never hit it. So it needs added in a different TextContent after the TextContent with the cache marker. If there's something in the cache-marked TextContent that changes, that could explain it too. 🤔
It's interesting that this thing, whatever it is, seems to only kick into action after the first condensation, not before it, if I interpret this correctly.
@csmith49 thanks for giving it a try. I can't say I like the results 😥. Something is clearly not working as planned. I would like to know what, because when I last ran the code, it did not act like that. But there was a merge. It might have brought something in that broke it in some way the tests did not catch.
I mean something changes literally: I think we'd see this graph if the content of a message soon after the system message changes, because that invalidates the cache for the rest of the context after it. One of the first five messages maybe.
An alternative explanation could be something like this. I just saw we currently don't use that anymore. When we used it: the reminder phrase contained a variable, which was incremented every step. That cannot be part of the cached prompt, because if it was, it would write the cache, and never hit it. So it needs added in a different TextContent after the TextContent with the cache marker. If there's something in the cache-marked TextContent that changes, that could explain it too. 🤔
image It's interesting that this thing, whatever it is, seems to only kick into action after the first condensation, not before it, if I interpret this correctly.
Yes. Some weird stuff going on after the first condensation that effectively disables caching. But on top of that, the resolution rate also goes down.
The good thing is, the cache write during condensation is gone.
@csmith49 can you provide the trajectory or other output of a single iteration where it does badly, so I can take a look at it? Maybe it's something obvious.
I'm guessing the evaluation entry-point I used isn't setting up the agent correctly
Yeah, that could explain things 😅
It would be best If I could run swe-bench myself.
Which in my case involves buying a new hard-drive.
Actually, it also involves someone else buying the hard-drive, downloading swe-bench on it, and then mailing it to me.
ok. Its not that bad.
But still. 4 days.
@happyherp I've got the trajectories for all three runs loaded up here: https://github.com/csmith49/oh-trajectories
Yes. Some weird stuff going on after the first condensation that effectively disables caching. But on top of that, the resolution rate also goes down.
If you've seen non SWE-bench runs with the appropriate caching behavior, I'm fine chalking this up to my configuration mistake for now. Maybe the completion logs can reveal more?
The resolution rate is a bit harder to explain. My gut says there's too much context being summarized and it's hurting their quality, but I haven't looked through the trajectories to see yet.
The good thing is, the cache write during condensation is gone.
That's also gone for the current condenser (thanks to #7781), but the runs I'm showing you were before that was merged.
@happyherp I ran 2 instances just so we can look at them, attached: claude-3-7-sonnet-run1.zip
Command:
...EVAL_DOCKER_IMAGE_PREFIX="us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images" \./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.claude-ah HEAD CodeActAgent 2 100 2 "princeton-nlp/SWE-bench_Verified" test condenser.default_4_20
config.toml:
[condenser.default_4_20]
type="agentcache"
keep_first=4
max_size=20
Branch: this branch + https://github.com/All-Hands-AI/OpenHands/commit/2a041e70a4288cecbc217db83c1043134665c0ff
Please note that this has only inference results, one ended exceeding max iterations
Ah, thank you Calvin! That's awesome, the llm_completions folder in particular, is great to see what actually happened.
On a side note, I picked up the way I configured the condenser for eval runs in a PR here:
- https://github.com/All-Hands-AI/OpenHands/pull/8177
Hopefully makes it a little easier to configure and run.
@happyherp I ran 2 instances just so we can look at them, attached: claude-3-7-sonnet-run1.zip
@enyst I would love to take a look. The .zip seems to be broken(invalid-content). Can you upload it again?
@happyherp I've got the trajectories for all three runs loaded up here: https://github.com/csmith49/oh-trajectories @csmith49 thanks for uploading that.
I see where the problem is: The Condensation Summary is always kept at the end of the conversation.
Example: https://github.com/csmith49/oh-trajectories/blob/main/cache-reuse-run-1/llm_completions/django__django-15916/litellm_proxy__claude-3-7-sonnet-20250219-1745862709.098081.json
[
{
"content": [
{
"type": "text",
"text": "The file /workspace/django__dja...."
}
],
"role": "tool",
"tool_call_id": "toolu_01Cb35MQtV5RWmwvXvEcuLFz",
"name": "str_replace_editor"
},
{
"content": [
{
"type": "text",
"text": "USER_CONTEXT: Implement a fix for Django's...",
"cache_control": {
"type": "ephemeral"
}
}
],
"role": "user"
}
]
The event with the condensation-summary is always the last one, and it has the "cache_control". This effectively disables caching.
The cache is build on the summary-event, but on the next call, another messages is put in front of it. So it can never be used. I found the same in every file with a condensation that I looked at. Always at the end. Why that is suddenly happening, I don't know. But I will find out tomorrow.
I made a fix for the problem of the summary staying at the end of the event-list.
@csmith49 @enyst can you try again?
@happyherp Did another quick run with a very small subset (16) just to check the caching behavior. Looks like your fix worked:
This graph is not to be compared directly with those above -- I've reduced the number of max iterations and increased the condensation frequency. But there's clearly more cache reads than writes now.
If it's helpful, all the notebooks I've used to generate condenser graphs can be found here, though they might take some tweaking to get working.
@happyherp Just to note quickly a thought. This is a large PR and not everything is part of the new condenser, it seems to me that we could perhaps split out in new PRs a couple of things, such as the condense command (we are starting to have commands like this, or the bug fix for initial user message (maybe in the form of validating / enforcing no less than 4 keep_first, WDYT?)
@enyst @csmith49 I am happy for any suggestions to get this merged. Because keeping this branch up to date is painful, because there are changes to the agent and condenser interfaces.
the keyword trigger
If you want, I can remove this one from the PR without much trouble. Its just that it is not much more than a configuration-setting + 30 line method
Change of Condenser.condense and Agent-Interface LLMCompletionProvider
These are the changes that keep causing merge conflicts, because they touch so many classes.
https://github.com/All-Hands-AI/OpenHands/pull/7893/commits/797acd021e992aff384dc560d2ecb1fa6973821e https://github.com/All-Hands-AI/OpenHands/pull/7893/commits/6b8cd2025d3f5b7ff3dfb70a2ea0794612c24f7c
But without them, the condenser can not access the agent. And that is a hard requirement for condenser to be able to build the llm-request the same way the agent would. So these are the changes I want to get merged as soon as possible. Would it be an option to do that in a separate PR, even if no code actually uses the new interface?
Implementation
The actual work is mostly in two new files: https://github.com/happyherp/OpenHands/blob/condenser_experiment/openhands/memory/condenser/impl/caching_condenser.py https://github.com/happyherp/OpenHands/blob/condenser_experiment/openhands/memory/condenser/impl/llm_agent_cache_condenser.py
Plus the standard condenser-configuration code.
On their own, they don't do anything unless the condenser is explicitly selected.
Maybe they could be merged with some EXPERIMENTAL! notes, so that they can at least be used in eval and other modes where the user knows what he is doing?
Using this condenser as the new default
https://github.com/All-Hands-AI/OpenHands/pull/7893/commits/6a1f5fd812ac6b2d2848895c548a9f640785642b
I already marked that commit as to be dropped. It was just easier for me to test it that way, because I like to test with the webapp. By the way, there should be a better way to do this. In webapp mode, the only way to configure your condenser seems to be to edit session.py.
So I am very happy to remove anything that changes how openhands works in the default-configuration.
CondensationAction.summary_offset = None Behavior
https://github.com/All-Hands-AI/OpenHands/pull/7893/files#diff-ab802ba1898b725ac35215fb989266a86085b88e079b37c1fbbf3d50a6582ed6
Could be removed from this PR. I would then need to set that value to summary_offset in the condenser.
Other PR shares code
https://github.com/All-Hands-AI/OpenHands/pull/8102 branches off from this PR's branch, because it also needs to access the agent from a condenser. In that PR I also use the condenser from this PR as the first condenser to use this trigger in. But it would work with any condenser. So it only really depends on the agent and condenser interface changes from this PR.
suggestion
How about 3 PRs
- Let condenser access the agent - contains all the interface/agent changes
- Condenser that uses Cache. That would be this one, but rebased on the first PR.
- Trigger Condensation by token-count. https://github.com/All-Hands-AI/OpenHands/pull/8102 , but rebased on the first PR
That way we could merge the interface changes first. That will make it a lot easier to keep the other ones synced. litellm just merged my fixes, so the trigger-by-token PR could even be merged before this one if that works the way it should.
Those are good points! I will obviously come back on this, but I do want to note another quick thought on the resolution rate, just to make sure I'm wrong (please tell me if so!), because it's bugging me and then we can get this out of the way:
Context sent to the LLM looked like this:
- system message
- task / instructions
- (recall action) - invisible to the LLM
- (recall obs) - visible in regular runs, invisible on swe-bench / evals
- events: agent action, obs, etc...
- summary prompt
I saw in the small test I made that the LLM ignored the summary prompt, and just continued its task. Then the context was mostly wiped, and it had no summary. Maybe it was a fluke or something I did.
Or we may have a more fundamental problem: as far as I recall, with long context Anthropic recommends to have the instruction at the beginning of context, because, they say, Claude works better that way.
Funnily enough, OpenAI recommends to be at end, and if developers really need it at the beginning, then they recommend both ends. 😭
(FWIW with Gemini I got to a pattern like: task first, instructions right after it, events, every x events remind it of the instructions. 😂)
What are your thoughts, is this totally off base and maybe I even misremember how it works, or should it be rare? Or maybe, if it's accurate, we may need to rethink the prompt...?
Context sent to the LLM looked like this:
system message task / instructions (recall action) - invisible to the LLM (recall obs) - visible in regular runs, invisible on swe-bench / evals events: agent action, obs, etc... summary prompt
Yes, that is the way it is supposed to work. Because I want to make use of caching, I can only add the prompt for the summary at the end.
I saw in the small test I made that the LLM ignored the summary prompt, and just continued its task. Then the context was mostly wiped, and it had no summary. Maybe it was a fluke or something I did.
Here I would need to look at the logs to know exactly what happened. What do you mean by, it ignored the summary prompt? The response of the llm would be processed by the condenser: https://github.com/happyherp/OpenHands/blob/condenser_experiment/openhands/memory/condenser/impl/caching_condenser.py#L88 calls the llm and passes to response to https://github.com/happyherp/OpenHands/blob/condenser_experiment/openhands/memory/condenser/impl/llm_agent_cache_condenser.py#L113 which uses the content as the summary for the new CondensationAction.
Then, on the next llm call you should see
- system message
- task / instructions
- (recall action) - invisible to the LLM
- (recall obs) - visible in regular runs, invisible on swe-bench / evals
- summary: the response of the llm call by the condenser
- events: new events would be added here, after the summary. That was what was wrong the last time. The summary would always be at the end.
If you used this condenser from the webapp, you might not even notice a condensation happened, because they are not shown there. Or is that just a bug because I have set summary, but not summary_offset?
Or we may have a more fundamental problem: as far as I recall, with long context Anthropic recommends to have the instruction at the beginning of context, because, they say, Claude works better that way.
Then the approach here will not work. It relies on the summary-prompt being added at the end, so that the PREFIX stays unchanged, so we can cache.
If we want to have the summary-instructions before the events, then it needs to be there from the start. We can not insert it later without causing a cache-miss.
There is another way to do a condensation, where the summary-prompt at the beginning and we still get caching: Condensation as a tool. https://github.com/All-Hands-AI/OpenHands/pull/8246/files#diff-1c3bca1c2fb93868dc3262439f8d24f0af08417f86f317da25e6e28e005202ffR214 I worked on this a while ago. But that approach is even more different from what we are doing here. Also it would make it easy for the agent to control its memory itself.
I saw in the small test I made that the LLM ignored the summary prompt, and just continued its task. Then the context was mostly wiped, and it had no summary. Maybe it was a fluke or something I did.
Here I would need to look at the logs to know exactly what happened. What do you mean by, it ignored the summary prompt? The response of the llm would be processed by the condenser
I mean ignored as in, next step was:
- system prompt
- instructions
- ... events, up to a last_event
- summary prompt
- event where the LLM continues reasoning from last event
I uploaded the test here FWIW, although Calvin's full runs are better. Both are before the bug fix. Will come back on this.
Or we may have a more fundamental problem: as far as I recall, with long context Anthropic recommends to have the instruction at the beginning of context, because, they say, Claude works better that way.
Then the approach here will not work. It relies on the summary-prompt being added at the end, so that the PREFIX stays unchanged, so we can cache.
If we want to have the summary-instructions before the events, then it needs to be there from the start. We can not insert it later without causing a cache-miss.
There is another way to do a condensation, where the summary-prompt at the beginning and we still get caching: Condensation as a tool.
An agent controlling its own memory sounds fascinating. ❤️
(though we may want to think more how to do it to keep the work manageable. 🤔 One thing could be to make it as a separate agent inherited from CodeAct)
We may have another option if so, though: we could also try to add to the prompt (system or instruction), as information about a future event, e.g.
- "Apart from your primary task, you will be tasked sometimes with: ...explanation of summarization, warning to not use it until specifically told to."
- events
- when the time comes, "IMPORTANT" 😂 < continue to remind it the actual instructions >
I think you're right though, a tool is the 'standardized' and better way to do... something similar: the tool description comes first but in tool format.
We may have another option if so, though: we could also try to add to the prompt (system or instruction), as information about a future event, e.g.
"Apart from your primary task, you will be tasked sometimes with: ...explanation of summarization, warning to not use it until specifically told to." events when the time comes, "IMPORTANT" 😂 < continue to remind it the actual instructions > I think you're right though, a tool is the 'standardized' and better way to do... something similar: the tool description comes first but in tool format.
That was pretty much my chain of thought as well. Putting the prompt first, including something like a format-description, felt a lot like a tool-call, before we had tool-calls.
To make it work the same way as before, one would still need a condenser very similiar to this one. Except the condensation-prompt is just: "Do a condensation now".
I wonder what happens, if you do not even do that. And instead just add Do a condensation whenever you feel it makes sense. Especially when the conversation approaches 3/4 of the context window. to the openhands-prompt.
@happyherp another PR of yours. Just for a pulse check, is this still in the works?
Hello! I'm going to close this due to lack of activity, but happy to have you take another stab if interested!