OpenHands Fix/re-implement the memory condenser

This PR proposes a re-implementation of the short term memory condense process, currently in use for monologue agent. Should address @li-boxuan comments on the previous PR.

CC: @xingyaoww I did not check CodeAct, this PR is basically continuing to rewrite / refactor the monologue stuff a little more generically.

Goals:

[x] a little better condense - e.g. - multiple attempts to condense to avoid ContextWindowLimitError - only do recent events, not the default actions always included in the prompt - do more on earlier memories
[x] refactor (monologue) prompts to split: - default events ('thoughts' that the monologue agent learns in the initial prompt and since then they're supposed to be in every prompt) - recent events (current session). Only recent events should be summarized, monologue was doing them all.
[x] an interface for agents to use the memory condenser
- a simple interface: condense(), needs_condense(), get_token_limit()
- not happy with some of it though, for example init needs Callables which assume knowledge of tasks and background commands, and tasks aren't even used at all; should more likely be some generate_prompt() which give only what is needed, ~~or just the size of the relevant prompts~~.
[x] refactor monologue to move out all of the code for condensing, e.g. parsing summaries, checking tokens

May 13 '24 23:05 enyst

Should address @li-boxuan comments on the previous PR.

Could u please remind me what the previous PR was 🤣 I have a very short memory and cannot wait to fix my brain with your condenser

May 14 '24 22:05 li-boxuan

Oh just this bit. https://github.com/OpenDevin/OpenDevin/pull/1614#discussion_r1591899154

May 14 '24 22:05 enyst

Can you check your environmental variables and see if anything contains “workplace”?

Get Outlook for iOShttps://aka.ms/o0ukef

From: Engel Nyst @.> Sent: Wednesday, May 15, 2024 12:05:29 AM To: OpenDevin/OpenDevin @.> Cc: @.*** @.>; Mention @.> Subject: Re: [OpenDevin/OpenDevin] Fix/re-implement the memory condenser (PR #1771)

@enyst commented on this pull request.

In tests/integration/mock/CodeActAgent/test_edits/prompt_002.loghttps://github.com/OpenDevin/OpenDevin/pull/1771#discussion_r1601067204:

@@ -237,10 +237,10 @@ open bad.txt

OBSERVATION: -[File: /workspace/bad.txt (4 lines total)] -1:This is a stupid typoo. -2:Really? -3:No mor typos! +[File: /workplace/bad.txt (4 lines total)]

No idea what happened here, I only generated with the regenerate script

— Reply to this email directly, view it on GitHubhttps://github.com/OpenDevin/OpenDevin/pull/1771#discussion_r1601067204, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGENUWUTRA7QQX4QKF3T443ZCMCLTAVCNFSM6AAAAABHVDD4FSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANJXGA4DEMBQGA. You are receiving this because you were mentioned.Message ID: @.***>

May 15 '24 07:05 li-boxuan

Ah! Yes, it does include 'workplace'. I'll regenerate in a bit on a clean environment.

May 15 '24 07:05 enyst

I would suggest reverting all existing test artifact changes first before you regenerate. The script is designed in a way that if the test passes, then it doesn't attempt to call real LLM and regenerate.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Engel Nyst @.> Sent: Wednesday, May 15, 2024 12:14:13 AM To: OpenDevin/OpenDevin @.> Cc: @.*** @.>; Mention @.> Subject: Re: [OpenDevin/OpenDevin] Fix/re-implement the memory condenser (PR #1771)

Ah! Yes, it does include 'workplace'. I'll regenerate in a bit on a clean environment.

— Reply to this email directly, view it on GitHubhttps://github.com/OpenDevin/OpenDevin/pull/1771#issuecomment-2111754263, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGENUWSZR2XSJLKIWLSER33ZCMDMLAVCNFSM6AAAAABHVDD4FSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRG42TIMRWGM. You are receiving this because you were mentioned.Message ID: @.***>

May 15 '24 07:05 li-boxuan

Btw could you share what environment variable was the culprit? I'd like to open a pr to fix the script.

Get Outlook for iOShttps://aka.ms/o0ukef

From: @.*** @.> Sent: Wednesday, May 15, 2024 12:23:06 AM To: OpenDevin/OpenDevin @.>; OpenDevin/OpenDevin @.> Cc: Mention @.> Subject: Re: [OpenDevin/OpenDevin] Fix/re-implement the memory condenser (PR #1771)

I would suggest reverting all existing test artifact changes first before you regenerate. The script is designed in a way that if the test passes, then it doesn't attempt to call real LLM and regenerate.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Engel Nyst @.> Sent: Wednesday, May 15, 2024 12:14:13 AM To: OpenDevin/OpenDevin @.> Cc: @.*** @.>; Mention @.> Subject: Re: [OpenDevin/OpenDevin] Fix/re-implement the memory condenser (PR #1771)

Ah! Yes, it does include 'workplace'. I'll regenerate in a bit on a clean environment.

— Reply to this email directly, view it on GitHubhttps://github.com/OpenDevin/OpenDevin/pull/1771#issuecomment-2111754263, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGENUWSZR2XSJLKIWLSER33ZCMDMLAVCNFSM6AAAAABHVDD4FSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRG42TIMRWGM. You are receiving this because you were mentioned.Message ID: @.***>

May 15 '24 07:05 li-boxuan

All clean after restore, and they don't need anything, it seems. Monologue had changes only to the summarize prompt, which is not used.

The variables were in toml this time:

[core]
workspace_base="/Users/enyst/repos/workplace"
workspace_mount_path="/Users/enyst/repos/workplace"
workspace_mount_path_in_sandbox="/workplace"

It looks to me like just good old PEBKAC. I should have had a clean environment, I believe you documented that somewhere. 😊

May 15 '24 08:05 enyst

I should have had a clean environment

That would be great but shouldn't be mandatory. I'll check if the script doesn't mask any environmental variable here. Thanks for the information!

Aha yeah we didn't mask WORKSPACE_MOUNT_PATH_IN_SANDBOX variable in the script. I'll fix it!

UPDATE: already fixed!

May 15 '24 16:05 li-boxuan

@li-boxuan Re: your question about 3 times: good question! My thinking is like this: we have a set of events, they're too large, and the last few matter. Constraint: the summarize prompt might (and does, in practice!) run into ContextWindowExceeded itself, if it's sent with all recent events.

The PR is attempting two naive approaches to make it work better:

condense half, repeat if still necessary. I tested with 5 times, and reduced it. If the need for condense happened when there were 30 events (which can happen, though not with GPT-4-1106 or the like), then 3 is almost too many times, because in the worse case (each half resulted in 1-2 summary events; but it still needs), we're left with very few fully detailed.
just tell the LLM to prioritize later events. I'm not sure if that works well with the PR prompt. 😢 I think it stopped being worse than the current prompt. IMHO it has the potential to work better than some naive in-code approach, and I've considered to drop the first and just keep and iterate on this one.

Still, I think we need something to deal with the possibility that summarizing all events in the same time errors out. Even if it's only 1 round with a subset. This PR doesn't attempt to make a good summarization, just make it work more times than it does. (and take some more stuff out of monologue)

I don't know, perhaps @rbren or @xingyaoww can tell if this PR makes it better or worse, or should we just cut monologue out until the condensing process is ready for CodeAct and then we see, or some other alternative.

Side note: What this PR also doesn't do, since it doesn't touch the last events, is deal with a funny case: the last observation was significantly larger than the rest. Like when GPT-4 escaped on the internet and came back with more than 80k tokens - good times! It doesn't deal with it because that case should be handled now, we have elsewhere code that trims the output to a low number of characters (hard-coded 5000 for monologue I believe); and we have also code that removes "heavy stuff" like screenshots that weren't supposed to be sent anyway. That should work. Give or take that I've seen some weird things, that I can't replicate, like output from docker spilling over... somehow. Perhaps it was before all these other measures were put in place.

May 17 '24 23:05 enyst

Thank you! Yeah that’s a very interesting and challenging topic, and I wonder if we should look at some recent research works on it.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Engel Nyst @.> Sent: Friday, May 17, 2024 4:24:40 PM To: OpenDevin/OpenDevin @.> Cc: @.*** @.>; Mention @.> Subject: Re: [OpenDevin/OpenDevin] Fix/re-implement the memory condenser (PR #1771)

@li-boxuanhttps://github.com/li-boxuan Re: your question about 3 times: good question! My thinking is like this: we have a set of events, they're too large, and the last few matter. Constraint: the summarize prompt might (and does, in practice!) run into ContextWindowExceeded itself, if it's sent with all recent events.

The PR is attempting two naive approaches to make it work better:

condense half, repeat if still necessary. I tested with 5 times, and reduced it. If the need for condense happened when there were 30 events (which can happen, though not with GPT-4-1106 or the like), then 3 is almost too many times, because in the worse case (each half resulted in 1-2 summary events; but it still needs), we're left with very few fully detailed.
just tell the LLM to prioritize later events. I'm not sure if that works well with the PR prompt. 😢 I think it stopped being worse than the current prompt. IMHO it has the potential to work better than some naive in-code approach, and I've considered to drop the first and just keep and iterate on this one.

Still, I think we need something to deal with the possibility that summarizing all events in the same time errors out. Even if it's only 1 round with a subset. This PR doesn't attempt to make a good summarization, just make it work more times than it does. (and take some more stuff out of monologue)

I don't know, perhaps @rbrenhttps://github.com/rbren or @xingyaowwhttps://github.com/xingyaoww can tell if this PR makes it better or worse, or should we just cut monologue out until the condensing process is ready for CodeAct and then we see, or some other alternative.

Side note: What this PR also doesn't do, since it doesn't touch the last events, is deal with a funny case: the last observation was significantly larger than the rest. Like when GPT-4 escaped on the internet and came back with more than 80k tokens - good times! It doesn't deal with it because that case should be handled now, we have elsewhere code that trims the output to a low number of characters (hard-coded 5000 for monologue I believe); and we have also code that removes "heavy stuff" like screenshots that weren't supposed to be sent anyway. That should work. Give or take that I've seen some weird things, that I can't replicate, like output from docker spilling over... somehow. Perhaps it was before all these other measures were put in place.

— Reply to this email directly, view it on GitHubhttps://github.com/OpenDevin/OpenDevin/pull/1771#issuecomment-2118483866, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGENUWUSZ7HHRHN5OBGJ6CDZC2GTRAVCNFSM6AAAAABHVDD4FSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGQ4DGOBWGY. You are receiving this because you were mentioned.Message ID: @.***>

May 18 '24 00:05 li-boxuan

This LGTM overall!

We'll probably need to iterate on this a bunch, but the "condense the first half" strategy seems like a good idea.

May 18 '24 16:05 rbren

Btw, ^^If this is completed. I think this PR should have the potential to solve https://github.com/OpenDevin/OpenDevin/issues/1748

May 19 '24 08:05 xingyaoww

Hey @enyst , could you ping me when the conflicts are resolved and I'll take a look?

May 23 '24 15:05 neubig

@neubig Thank you. I have a WIP to implement the behavior suggested by @xingyaoww and apply it for CodeAct. I think it's best to close this one, and open a new PR, from a branch that was restarted almost clean with a patch applied to main. (I'll also move it to a branch in the main repo)

This was initially made with monologue in mind, but I don't know, applying the older version won't really work as such because monologue is also cut off from condenser atm, since we've updated it to not store its own event history here https://github.com/OpenDevin/OpenDevin/pull/1863. We can re-attach it when done, and summarized events/messages are restored in the event stream. (I think?)

This got so interesting when I attempted to apply these insights into how it could work. I feel it changes deeply both the process and even whether the summary is an Action or an Observation 😅. The best part is, CodeAct itself is working on it!

May 23 '24 18:05 enyst

Cool, we can close this, but that's exciting!

May 23 '24 18:05 neubig