Control Loop: long term planning and execution
The biggest, most complicated aspect of Devin is long-term planning and execution. I'd like to start a discussion about how this might work in OpenDevin.
There's some recent prior work from Microsoft with some impressive results. I'll summarize here, with some commentary.
Overall Flow
- User specifies objective and associated settings
- Conversation Manager kicks in
- Sends convo to Agent Scheduler
- Agents execute commands
- Output is placed back into the conversation
- Rinse and repeat
Configuraiton
- A YAML file defines a set of actions/commands the bot can take (e.g.
npm test)- comment: why not just leave it open-ended?
- You can have different agents with different capabilities, e.g. a "dev agent" and a "reviewer agent", who work collaboratively
- comment: this sounds like MetaGPT
Components
Conversation Manager
- maintains message history and command outputs
- decides when to interrupt the conversation
- comment: for what? more info from the user?
- decides when the conversation is over, i.e. task has been completed
- agent can send a "stop" command, max tokens can be reached, problems w/ execution environment
Parser
- interprets agent output and turns it into commands, file edits, etc
- in case of parsing failure, a message is sent back to the agent to rewrite its command
Output Organizer
- Takes command output and selectively places it into the conversation history
- sometimes summarizes the content first
- comment: why not just drop everything back into the conversation history (maybe truncating really long CLI output)
Agent Scheduler
- orchestrates different agents
- uses different algos for deciding who gets to go next
- round-robin: everyone takes turns in order
- token-based: agent gets to keep going until it says it's done
- priority-based: agents go based on (user defined?) priority
Tools Library
- file editing (can edit entire file, or specify start line and end line)
- retrieval (file contents,
ls,grep). Seems to use vector search as well - build and execution: abstracts away the implementation in favor of simple commands like
build foo - testing and validation: includes linters and bug-finding utils
- git: can commit, push, merge
- communication: can as human for input/feedback, can talk to other agents
Evaluation Environment
- runs in Docker
I've also been experimenting heavily with long-term planning. I've gotten good results by allowing the bot to manage its own short-term memory (context window) and long-term memory (vector database).
Here's the basic flow I'm converging on:
- The context window is the bot's "internal monologue". This includes:
- Every message the bot has sent back
- Every output from the command line
- Every HTML output from the browser
- The internal monologue is periodically summarized (using a separate agent) and condensed to keep it under a certain token limit
- Old thoughts are summarized more aggressively than recent thoughts
- Summarizer is told to preserve information related to the overall goal
- The verbatim history of the internal monologue is kept in a vector database for later retrieval
- The bot can issue a "RECALL" command to search the database
- Output is placed directly into the monologue
- The bot can issue a "RECALL" command to search the database
I find it's also helpful to instruct the bot to always think more between taking actions. So it edits a file, then says "I think I should run this command next", then runs the command.
It's also helpful to seed the internal monologue. I've been working with a prompt like this (where I manually execute the actions it requests):
You're a thoughtful robot. This is your internal monologue.
* 1: I exist!
* 2: Hmm...looks like I can type in a command line prompt
* 3: Looks like I have a web browser too!
* 4: This is what I want: to build a TODO list app in React and express
* 5: How am I going to get there though?
* 6: It seems like I have some kind of short term memory.
* 7: Each of my thoughts seems to be stored in a numbered list.
* 8: It seems whatever I say next will be added to the list.
* 9: But no one has perfect short-term memory. My list of thoughts will be summarized and condensed over time, losing information in the process.
* 10: Fortunately I have long term memory!
* 11: I can just say RECALL, followed by the thing I want to remember. And then related thoughts just spill out!
* 12: Sometimes they're random thoughts that don't really have to do with what I wanted to remember. But usually they're exactly what I need!
* 13: Let's try it out!
* 14: RECALL what it is I want to do
* 15: 4: This is what I want: to build a TODO list app in React and express
* 16: 5: How am I going to get there though?
* 17: Neat! And it looks like it's easy for me to use the command line too! I just have to say RUN followed by the command I want to run. The command output just jumps into my head!
* 18: RUN echo "hello world"
* 19: hello world
* 20: Cool! I bet I can read and edit files too.
* 21: RUN echo "console.log('hello world')" > test.js
* 22:
* 23: I just created test.js. I'll try and run it now.
* 24: RUN node test.js
* 25: hello world
* 26: it works!
* 27: And if I want to use the browser, I just need to say BROWSE, followed by a website I want to visit, or an action I want to take on the current site
* 28: Let's try that...
* 29: BROWSE visit google.com
* 30: <form><input type="text"></input><button type="submit"></button></form>
* 31: Cool, looks like there's a form with a text input. I bet I can put any search query there, then click submit to see the results.
* 32: BROWSE type "who am I" and click submit
* 33: <div class="result"></div>
* 34: Very cool. Now to accomplish my task.
* 35: I'll need a strategy. And as I make progress, I'll need to keep refining that strategy. I'll need to set goals, and break them into sub-goals.
* 36: In between actions, I must always take some time to think, strategize, and set new goals. I should never take two actions in a row.
* 37: OK so my task is to build a TODO list app in React and express
what is your next thought or action (RUN, BROWSE, RECALL)
We've got our foot in the door here with #35! We can probably continue the discussion elsewhere