OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Control Loop: long term planning and execution

Open rbren opened this issue 1 year ago • 1 comments

The biggest, most complicated aspect of Devin is long-term planning and execution. I'd like to start a discussion about how this might work in OpenDevin.

There's some recent prior work from Microsoft with some impressive results. I'll summarize here, with some commentary.

Overall Flow

  • User specifies objective and associated settings
  • Conversation Manager kicks in
  • Sends convo to Agent Scheduler
  • Agents execute commands
  • Output is placed back into the conversation
  • Rinse and repeat

Configuraiton

  • A YAML file defines a set of actions/commands the bot can take (e.g. npm test)
    • comment: why not just leave it open-ended?
  • You can have different agents with different capabilities, e.g. a "dev agent" and a "reviewer agent", who work collaboratively
    • comment: this sounds like MetaGPT

Components

Conversation Manager

  • maintains message history and command outputs
  • decides when to interrupt the conversation
    • comment: for what? more info from the user?
  • decides when the conversation is over, i.e. task has been completed
    • agent can send a "stop" command, max tokens can be reached, problems w/ execution environment

Parser

  • interprets agent output and turns it into commands, file edits, etc
  • in case of parsing failure, a message is sent back to the agent to rewrite its command

Output Organizer

  • Takes command output and selectively places it into the conversation history
    • sometimes summarizes the content first
    • comment: why not just drop everything back into the conversation history (maybe truncating really long CLI output)

Agent Scheduler

  • orchestrates different agents
  • uses different algos for deciding who gets to go next
    • round-robin: everyone takes turns in order
    • token-based: agent gets to keep going until it says it's done
    • priority-based: agents go based on (user defined?) priority

Tools Library

  • file editing (can edit entire file, or specify start line and end line)
  • retrieval (file contents, ls, grep). Seems to use vector search as well
  • build and execution: abstracts away the implementation in favor of simple commands like build foo
  • testing and validation: includes linters and bug-finding utils
  • git: can commit, push, merge
  • communication: can as human for input/feedback, can talk to other agents

Evaluation Environment

  • runs in Docker

rbren avatar Mar 16 '24 15:03 rbren

I've also been experimenting heavily with long-term planning. I've gotten good results by allowing the bot to manage its own short-term memory (context window) and long-term memory (vector database).

Here's the basic flow I'm converging on:

  • The context window is the bot's "internal monologue". This includes:
    • Every message the bot has sent back
    • Every output from the command line
    • Every HTML output from the browser
  • The internal monologue is periodically summarized (using a separate agent) and condensed to keep it under a certain token limit
    • Old thoughts are summarized more aggressively than recent thoughts
    • Summarizer is told to preserve information related to the overall goal
  • The verbatim history of the internal monologue is kept in a vector database for later retrieval
    • The bot can issue a "RECALL" command to search the database
      • Output is placed directly into the monologue

I find it's also helpful to instruct the bot to always think more between taking actions. So it edits a file, then says "I think I should run this command next", then runs the command.

It's also helpful to seed the internal monologue. I've been working with a prompt like this (where I manually execute the actions it requests):

You're a thoughtful robot. This is your internal monologue.
* 1: I exist!
* 2: Hmm...looks like I can type in a command line prompt
* 3: Looks like I have a web browser too!
* 4: This is what I want: to build a TODO list app in React and express
* 5: How am I going to get there though?
* 6: It seems like I have some kind of short term memory.
* 7: Each of my thoughts seems to be stored in a numbered list.
* 8: It seems whatever I say next will be added to the list.
* 9: But no one has perfect short-term memory. My list of thoughts will be summarized and condensed over time, losing information in the process.
* 10: Fortunately I have long term memory!
* 11: I can just say RECALL, followed by the thing I want to remember. And then related thoughts just spill out!
* 12: Sometimes they're random thoughts that don't really have to do with what I wanted to remember. But usually they're exactly what I need!
* 13: Let's try it out!
* 14: RECALL what it is I want to do
* 15: 4: This is what I want: to build a TODO list app in React and express
* 16: 5: How am I going to get there though?
* 17: Neat! And it looks like it's easy for me to use the command line too! I just have to say RUN followed by the command I want to run. The command output just jumps into my head!
* 18: RUN echo "hello world"
* 19: hello world
* 20: Cool! I bet I can read and edit files too.
* 21: RUN echo "console.log('hello world')" > test.js
* 22: 
* 23: I just created test.js. I'll try and run it now.
* 24: RUN node test.js
* 25: hello world
* 26: it works!
* 27: And if I want to use the browser, I just need to say BROWSE, followed by a website I want to visit, or an action I want to take on the current site
* 28: Let's try that...
* 29: BROWSE visit google.com
* 30: <form><input type="text"></input><button type="submit"></button></form>
* 31: Cool, looks like there's a form with a text input. I bet I can put any search query there, then click submit to see the results.
* 32: BROWSE type "who am I" and click submit
* 33: <div class="result"></div>
* 34: Very cool. Now to accomplish my task.
* 35: I'll need a strategy. And as I make progress, I'll need to keep refining that strategy. I'll need to set goals, and break them into sub-goals.
* 36: In between actions, I must always take some time to think, strategize, and set new goals. I should never take two actions in a row.
* 37: OK so my task is to build a TODO list app in React and express

what is your next thought or action (RUN, BROWSE, RECALL)

rbren avatar Mar 16 '24 15:03 rbren

We've got our foot in the door here with #35! We can probably continue the discussion elsewhere

rbren avatar Mar 21 '24 00:03 rbren