OpenHands Lightweight 'OmniscientAgent' which the head agent can query for a broader view of the codebase

What problem or use case are you trying to solve? For example: 'Hey AI, what is the least mature module in this codebase?', or 'Where is the documentation lacking? Fill in some gaps.'

SWE-Agent and CodeAct offer interfaces to the AI which deliberately show only a small amount of code at a time when searching or when reading files. However my experience when using AIs to understand unfamiliar code bases is that when the AI can only see a small amount of code, it will often give misleading or unhelpful answers.

On one hand, the SWE-Agent paper found that showing only 100 lines of code at a time to the AI was optimal. However it looks like they were using GPT-4 in that experiment. It may be that the reason why showing 100 lines at a time to the AI was optimal for the solve rate is simply that GPT-4 makes very poor use of large context. (Poor score on Needle In A Haystack.)

In contrast, I have found that when I feed an entire code base into Haiku, it tends to give really useful answers to questions about the code. (Maybe not surprising, given that unlike GPT-4, Haiku gets a very strong score on Needle In A Hay Stack.) I would not be surprised if offering a tool for this as a sub-agent to be called upon by a larger model improves the solve rate for problems which require a broader understanding of the code base: The head agent (e.g. GPT-4o) could ask the sub-agent (e.g. Haiku) a question which requires a full view of the codebase when it suspects that its narrow searches are not going to yield the right information reliably.

Another reason why the size of the context matters is that it affect the cost of each message (in AI credits). It is not economical to feed large contexts into large models such as GPT-4o. However because Haiku is a fairly small model, asking a question about the code base tends to cost on the order of US 5c / message (for large context and ~1 paragraph of output). (Maybe Gemini 1.5 Flash is similar.)

Describe the UX of the solution you'd like This will require two additional items of user input:

Enable this sub-agent.
Provide the API key for the provider of the smaller model. (e.g. Maybe the head agent uses GPT-4o and the subagent uses Haiku. That means that two API keys are required instead of one.) The output to the user will show when this subagent is called by the head agent, and will show the instructions to the subagent, the size of the context that it loads, and the response from the subagent.

Do you have thoughts on the technical implementation? I gather that there is already some kind of framework in the works for subagents. The functionality of feeding a whole codebase into the AI has been implemented in a non-agentic way in the ClaudeCLI tool under the MIT license, in Python. (https://github.com/edwardbrazier/claudecli) The code could simply be copied from ClaudeCLI and integrated into a new agent. Maybe the most complex change will be to allow the user to supply multiple API keys for multiple models. Implementation would also require feeding the head agent some examples of when and how to call the sub-agent.

Describe alternatives you've considered The Aider algorithm for selecting files to load in the context is pretty good at this. However I think there are cases when feeding as much of the codebase as possible into the model still yields more reliable answers than Aider's approach.

Additional context Here is an example conversation:

USER: Hey AI agent, in this web app for document management, most of the pages require similar functionality because it's all about viewing and editing tables. Can you tell me which page of the web app is missing the most functionality, on the assumption that it needs similar functionality to the other pages?

HEAD AGENT [e.g. CodeAct w/ GPT-4o]: This is a question which I can't readily answer by looking at small chunks of source files at a time. I will need some help. <execute_ipython>

ask_omniscience (codebase_location='./src',
                 file_extensions=['ts','cs'],
                 question="Assuming that each page of this webapp requires a similar interface, which one is missing the most functionality relative to the others?")

</execute_ipython>

OMNISCIENCE [Haiku]: [Log message: Loaded 400 kB of source code.] For the file view page, the file view module lacks a lot of functionality which the other modules, for example the edit users module, do have. In the file view module, there isn't yet a way to edit cells in the table that lists the file, or to reorder the entries by sorting.

HEAD AGENT: That seems to address the question adequately. The module with the most room for improvement is the file view page.

USER: You've listed there some of the functions that the file view page needs. List for me the new files, classes and methods that we'll need in order to add that functionality.

Jun 10 '24 10:06 edwardbrazier

This is great! @ryanhoangt and I were also discussing somehow giving models access to RAG-style tools, which is somewhat similar in motivation, but different in implementation. I would love to get something like this implemented, maybe we can discuss a bit more here!

Allowing different tools/agents to use different models is something I think we should support eventually. One way I can think of doing this is to have "LLM_API_KEY" be the default key, but also have "LLM_API_OPTIONS" which maps an input string (e.g. "efficient-llm") to an LLM-key pair (e.g. "haiku", "my-claude-key").

Jun 10 '24 10:06 neubig

Yeah, this idea is cool! I'm thinking about how can we facilitate meaningful and effective collaboration between the head agent and sub-agent. Specifically, for an OmniscientAgent that has complete knowledge of the codebase, the challenge I think is determining how the head agent can recognize when it needs to stop and ask questions to understand the codebase instead of taking any actions, or identify which questions to ask.

The output to the user will show when this subagent is called by the head agent, and will show the instructions to the subagent, the size of the context that it loads, and the response from the subagent.

@edwardbrazier can you help me explain a bit more about how the head agent interacts with the sub-agent? I'm not quite clear about the sentence above.

Jun 10 '24 11:06 ryanhoangt

@ryanhoangt , I've added an example conversation to the description above. Really the Omniscience isn't an AI agent as such, but just a chatbot which can be messaged by an agent. I don't think it even needs to keep its own conversation history. But I thought that in the existing framework, treating it as a sub-agent might be the natural choice.

Jun 11 '24 08:06 edwardbrazier

Thank you for the details, that's a good use case. We can try implementing such an agent and see how it performs!

Jun 11 '24 12:06 ryanhoangt

@edwardbrazier , are you interested in taking a shot at implementing this? If so that's great, if not @ryanhoangt might also be interested in trying it out.

Jun 11 '24 12:06 neubig

Ok, I will do this one.

Haven't got an estimate yet for how long I will need.

Will come back to you with some questions about design decisions, since this change might affect a few different parts of the codebase.

Jun 11 '24 21:06 edwardbrazier

Allowing different tools/agents to use different models is something I think we should support eventually. One way I can think of doing this is to have "LLM_API_KEY" be the default key, but also have "LLM_API_OPTIONS" which maps an input string (e.g. "efficient-llm") to an LLM-key pair (e.g. "haiku", "my-claude-key").

It seems to me that we almost have that implemented, in the form of the LLMConfig class and the ability to load any llm-specific config by name, mapping it to LLMConfig, for any name defined in the toml file. LLMConfig is a singleton today, like all config classes, so it updates itself from toml instead of creating a new instance, but if we want to change that rule, it could fit this purpose I think.

Jun 11 '24 22:06 enyst

Here is my design concept after a very brief look at the existing code. Let me know if you have any concerns.

I will add two agents:

VantagePointAgent (head agent; based on CodeAct)
OmniscientChatBot (sub-agent; based on ClaudeCLI repo)

The reason why I intend to make a new head agent (instead of modifying CodeActAgent) is: Not all providers have a model appropriate for OmniscientChatBot. Not all users will have an API key for Anthropic. Thus it is better to leave the default head agent as-is (presently CodeAct?) and provide VantagePointAgent for those who do have an API key for Anthropic.

One of my assumptions is that it might be tricky to add and remove capabilities dynamically because they depend on the example chat string that we use to teach the model how to use those capabilities. (multi-shot prompting) That's one reason why I will make a new head agent.

I intend to do this in two stages:

Initial implementation will support a single API key (Anthropic) for both agents, so that I don't need to change the UI for inputting API keys. OmniscientChatBot will be stateless, without access to a conversation history. Will submit a working version with these limitations.
Second stage will support two API keys, so that the VantagePointAgent can use for example GPT-4o and the OmniscientChatBot can be either Gemini 1.5 Flash or Claude Haiku.

Jun 12 '24 11:06 edwardbrazier

Hello @edwardbrazier as to configuration handling, you probably could already use OpenDevin's code to load at least the main config file and/or parts of it, e.g.

from opendevin.core.config import config
...
    config = AppConfig()
    load_from_toml(config) # loads core, llm and agent section into config
    # use config values like config.llm.api_key and config.llm.model

For more "examples" also have a look at the main opendevin/core/config.py file as well as the unit test file: tests/unit/test_config.py.

Jun 12 '24 12:06 tobitege

Describe alternatives you've considered