OpenHands Control Loop: long term planning and execution

The biggest, most complicated aspect of Devin is long-term planning and execution. I'd like to start a discussion about how this might work in OpenDevin.

There's some recent prior work from Microsoft with some impressive results. I'll summarize here, with some commentary.

Overall Flow

User specifies objective and associated settings
Conversation Manager kicks in
Sends convo to Agent Scheduler
Agents execute commands
Output is placed back into the conversation
Rinse and repeat

Configuraiton

A YAML file defines a set of actions/commands the bot can take (e.g. npm test)
- comment: why not just leave it open-ended?
You can have different agents with different capabilities, e.g. a "dev agent" and a "reviewer agent", who work collaboratively
- comment: this sounds like MetaGPT

Components

Conversation Manager

maintains message history and command outputs
decides when to interrupt the conversation
- comment: for what? more info from the user?
decides when the conversation is over, i.e. task has been completed
- agent can send a "stop" command, max tokens can be reached, problems w/ execution environment

Parser

interprets agent output and turns it into commands, file edits, etc
in case of parsing failure, a message is sent back to the agent to rewrite its command

Output Organizer

Takes command output and selectively places it into the conversation history
- sometimes summarizes the content first
- comment: why not just drop everything back into the conversation history (maybe truncating really long CLI output)

Agent Scheduler

orchestrates different agents
uses different algos for deciding who gets to go next
- round-robin: everyone takes turns in order
- token-based: agent gets to keep going until it says it's done
- priority-based: agents go based on (user defined?) priority

Tools Library

file editing (can edit entire file, or specify start line and end line)
retrieval (file contents, ls, grep). Seems to use vector search as well
build and execution: abstracts away the implementation in favor of simple commands like build foo
testing and validation: includes linters and bug-finding utils
git: can commit, push, merge
communication: can as human for input/feedback, can talk to other agents

Evaluation Environment

runs in Docker

Mar 16 '24 15:03 rbren

I've also been experimenting heavily with long-term planning. I've gotten good results by allowing the bot to manage its own short-term memory (context window) and long-term memory (vector database).

Here's the basic flow I'm converging on:

The context window is the bot's "internal monologue". This includes:
- Every message the bot has sent back
- Every output from the command line
- Every HTML output from the browser
The internal monologue is periodically summarized (using a separate agent) and condensed to keep it under a certain token limit
- Old thoughts are summarized more aggressively than recent thoughts
- Summarizer is told to preserve information related to the overall goal
The verbatim history of the internal monologue is kept in a vector database for later retrieval
- The bot can issue a "RECALL" command to search the database
  - Output is placed directly into the monologue

I find it's also helpful to instruct the bot to always think more between taking actions. So it edits a file, then says "I think I should run this command next", then runs the command.

It's also helpful to seed the internal monologue. I've been working with a prompt like this (where I manually execute the actions it requests):

You're a thoughtful robot. This is your internal monologue.
* 1: I exist!
* 2: Hmm...looks like I can type in a command line prompt
* 3: Looks like I have a web browser too!
* 4: This is what I want: to build a TODO list app in React and express
* 5: How am I going to get there though?
* 6: It seems like I have some kind of short term memory.
* 7: Each of my thoughts seems to be stored in a numbered list.
* 8: It seems whatever I say next will be added to the list.
* 9: But no one has perfect short-term memory. My list of thoughts will be summarized and condensed over time, losing information in the process.
* 10: Fortunately I have long term memory!
* 11: I can just say RECALL, followed by the thing I want to remember. And then related thoughts just spill out!
* 12: Sometimes they're random thoughts that don't really have to do with what I wanted to remember. But usually they're exactly what I need!
* 13: Let's try it out!
* 14: RECALL what it is I want to do
* 15: 4: This is what I want: to build a TODO list app in React and express
* 16: 5: How am I going to get there though?
* 17: Neat! And it looks like it's easy for me to use the command line too! I just have to say RUN followed by the command I want to run. The command output just jumps into my head!
* 18: RUN echo "hello world"
* 19: hello world
* 20: Cool! I bet I can read and edit files too.
* 21: RUN echo "console.log('hello world')" > test.js
* 22: 
* 23: I just created test.js. I'll try and run it now.
* 24: RUN node test.js
* 25: hello world
* 26: it works!
* 27: And if I want to use the browser, I just need to say BROWSE, followed by a website I want to visit, or an action I want to take on the current site
* 28: Let's try that...
* 29: BROWSE visit google.com
* 30: <form><input type="text"></input><button type="submit"></button></form>
* 31: Cool, looks like there's a form with a text input. I bet I can put any search query there, then click submit to see the results.
* 32: BROWSE type "who am I" and click submit
* 33: <div class="result"></div>
* 34: Very cool. Now to accomplish my task.
* 35: I'll need a strategy. And as I make progress, I'll need to keep refining that strategy. I'll need to set goals, and break them into sub-goals.
* 36: In between actions, I must always take some time to think, strategize, and set new goals. I should never take two actions in a row.
* 37: OK so my task is to build a TODO list app in React and express

what is your next thought or action (RUN, BROWSE, RECALL)

Mar 16 '24 15:03 rbren

We've got our foot in the door here with #35! We can probably continue the discussion elsewhere

Mar 21 '24 00:03 rbren