OpenHands Integrating MemGPT-like Functionality

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues? Fix https://github.com/OpenDevin/OpenDevin/issues/1748

This PR is proposing an implementation integrating MemGPT-like functionality to OpenDevin's ShortTermMemory inspired by issue https://github.com/OpenDevin/OpenDevin/issues/2487. It also uses ideas from https://github.com/OpenDevin/OpenDevin/pull/2021.

Give a brief summary of what the PR does, explaining any non-trivial design decisions The memory is summarized (condensed) when the token count approaches the context window limit but still hasn't hit the limit. All the events in the event stream are considered to be summarizable, except the system and example messages. When the condenser function is called, it tries to summarize events such that at least the first 0.75 fractions of the total tokens get summarized. After this, the summary is added to history. Events are summarized in the order of their occurrence, i.e., initial events are summarized first. The word limit for the summary is set to <=200 by default.

Messages are passed to llm in this order → (system message + example message + summary + all events after the last summarized event)

Currently, this functionality is only implemented for CodeAct Agent.

Jul 15 '24 09:07 khushvind

@khushvind This is amazing work, thank you! We definitely need something like this, and it's a smart algo.

I'm sorry I didn't get to look into it very well yet, but I did give it a try, and it seems there are a few issues. If you want, for replicating, I can give you the event stream I used, or, my scenario was something like: use a model with a relatively large context window (I used get-4o with 128k), make sure sessions are enabled[1], do stuff, then when the history gets to some 11k, stop it, config 4k in max_input_tokens, then restart. (or just change max_input_tokens to 4k in the debugger)

This will provoke it not only to apply summarize, but also do it consecutively before getting to a regular prompt, and it seems to behave unexpectedly in part, it loses information. Also, it doesn't seem to save the summary (my run ended with an error, but I'm not sure I see in the code where it saves them, even if it ends well?)

I can play with it more tomorrow I hope. Thank you for this!

[1] If you use opendevin via the UI, save JWT_TOKEN=something in env, it should do; just in case, enable file_store="local" with some file_store_path too in config.toml. If you use it with cli, enable_cli_sessions=true. I find that enabling/restoring sessions is a huge time saver for this kind of stuff, for... reasons. 😅

Jul 30 '24 00:07 enyst

Some thoughts, since we're at it: I think there are two major areas which have affected negatively performance in previous work.

the prompt. It didn't always result in detailed enough information to guide the LLM for the next steps. This PR is using mostly the old prompt from PR 2021, and I'm hoping we can figure out something better. At least, I think we need it to give clear info to the LLM about what files it shouldn't check anymore, and what files it should continue working on. I don't remember if this exact version was doing any of those reliably, but anyway, I wonder if we could adapt something more of memGPT here? I'd favor completely ripping it up and replace, if we can.
missing Recall action. This PR, like 2021, never retrieves the original events, thus must repeat them if it needs them. Now, this behavior in PR 2021 happened because that PR had gotten too large, I had to stop somewhere, and left recall/longterm memory for follow-up PRs. It might not have been the best idea ever, because this feature is incomplete without it. Also memGPT obviously relies on retrievals all the time (99% of all its memory work is recalling originals). That's totally as it should be, and we do plan to do that, but the question is: is it possible to be good enough without it? Cc: @xingyaoww

FWIW in the swe-bench evals on PR 2021, I saw some tasks where the LLM was just restarting, as if it knew nothing. It wasn't using the summary! I think this was due to both prompt and recall. Since this PR is condensing 75% of the messages, leaving a few at the end, I think we won't see that anymore. Nevertheless, TBH if we still don't include recall, we do need to fix the prompt very well...

I found some decent tasks on swe-bench for this, I'm going to retrieve them and run them on this PR. I think I'll also add some recall and see how it goes then.

Jul 30 '24 01:07 enyst

@khushvind Thanks for your amazing work -- Let me know if this is ready for review again!

I have a couple of ideas regarding how we can develop this:

We should develop a benchmark / a way to evaluate a "memory condensation algo quantitatively."
- it could be something like a repurposed version of Needle in a haystack For example, the input is a very long context prompt (128k tokens) with an instruction of "summarize everything, with a goal of XXX (that is related to the needle)"
- re-use benchmarks from MemGPT https://arxiv.org/pdf/2310.08560 to make sure we are getting comparable QA performance (using our condensation algo)
- Running SWE-Bench with condensation enabled (sort of an ultimate metrics); I think getting the first two for eval is probably easier
I think this PR is probably ready to merge (but not enabled by default) after we can get comparable performance to MemGPT using benchmarks in their paper
Once we achieve comparable and better performance with condensation on the SWE-Bench, I think we can enable it by default.

Jul 31 '24 15:07 xingyaoww

@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done. I am not able to run evaluations because, for some reason, my WSL started crashing yesterday😅. ig I'll have to do the set-up again, which might take some time.

I'll take a look at these benchmarks by the weekend.

Aug 01 '24 10:08 khushvind

@enyst

It didn't always result in detailed enough information to guide the LLM for the next steps.

The WORD_LIMIT might be the reason for it. It's currently set to 200, and it seems it is not enough to convey enough information in the summary after a lot of events. I need to check this.

This PR is using mostly the old prompt from PR 2021, and I'm hoping we can figure out something better. At least, I think we need it to give clear info to the LLM about what files it shouldn't check anymore, and what files it should continue working on. I don't remember if this exact version was doing any of those reliably, but anyway, I wonder if we could adapt something more of memGPT here? I'd favor completely ripping it up and replace, if we can.

MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation.

Aug 01 '24 10:08 khushvind

@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done. I am not able to run evaluations because, for some reason, my WSL started crashing yesterday😅. ig I'll have to do the set-up again, which might take some time.

I'll take a look at these benchmarks by the weekend.

Just to note, I played a little with it on some tasks on SWE-bench, and so far:

as we already know, it needs some stuff for the task and summary (on gpt-4o, which is trying to work)
more surprisingly, Claude doesn't work with our current prompt, it refuses to summarize and continues the task instead. (which is a little rude, if you ask me 🫢 😅)

I'm traveling today and tomorrow, we'll see soon though.

The WORD_LIMIT might be the reason for it. It's currently set to 200, and it seems it is not enough to convey enough information in the summary after a lot of events. I need to check this.

Maybe, but afaik the version in PR 2021 was almost identical without a word limit, and it didn't seem enough on SWE-bench tasks (otherwise in interactive sessions it works, since the user will nudge it). Worth trying though.

MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation.

That's an interestingly simple prompt! To clarify, the reason I say files is that on SWE bench, the LLM appeared to repeat the attempts to work with the wrong files like it had done before, after it figured out the right files. 🤔 Maybe it was a fluke...

Aug 01 '24 11:08 enyst

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sep 30 '24 02:09 github-actions[bot]

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Oct 31 '24 02:10 github-actions[bot]

This PR was closed because it has been stalled for over 30 days with no activity.

Nov 10 '24 02:11 github-actions[bot]