OpenHands feat(workflow): Implement a simplified CoAct workflow

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

This PR implements a simplified multi-agent workflow inspired by the CoAct paper.
Currently, in swe-bench eval, there are complex instances that OpenHands fails, especially ones that single CodeActAgent overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.
An overkill-ish successful trajectory with replanning can be found here.
A task which CoActPlannerAgent finished but CodeActAgent failed (I expected both to be able to complete it):
- CoAct traj
- CodeAct traj

Give a summary of what the PR does, explaining any non-trivial design decisions

Modify CodeAct to make it accept delegated task.
Implement 2 new agents, planner and executor with the same abilities as CodeAct, different system prompts, additional action parsers.
Nit: adjust the delegate message shown on UI.

Some next steps to improve this may be:

[x] Try eval on some swe-bench-lite instances.
[x] Adjust the system/user prompts and few-shot examples to further specialize the two agents. Also define the structure for the plan (e.g., its components, etc). 2 agents can now cooperate to finish a swe-bench issue.
Use meta prompt to reinforce the actions of agents, to make sure it follows the workflow.
Experiment with ability for the global planner to refuse the replan request from executor.
Implement ability for the delegated agent (e.g., BrowsingAgent or CoActExecutorAgent) to persist its history through the multi-turn conversation.

Link of any specific issues this addresses

https://github.com/All-Hands-AI/OpenHands/issues/3077

Sep 07 '24 13:09 ryanhoangt

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

Sep 07 '24 18:09 tobitege

just fyi, the integration tests seem to fail because of some
"action_suffix": "browse"
in some results.

Thanks, still waiting for reviews on it, if it is good to go I will look into the tests.

Sep 07 '24 18:09 ryanhoangt

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

Thanks Prof., I'll do and update how it goes soon.

Sep 10 '24 02:09 ryanhoangt

It might be in the paper(s), but I don't quite like that the prompts now talk of agent, while anywhere else it is assistant. 🤔

Sep 10 '24 17:09 tobitege

It might be in the paper(s), but I don't quite like that the prompts now talk of agent, while anywhere else it is assistant. 🤔

Make sense, tho i think it wouldn't make a lot of difference? But for consistency I'll change all to assistant.

Sep 11 '24 08:09 ryanhoangt

@neubig Hi Prof., till now I tested on a few (13) swe-bench instances that are mutual between swe-bench-lite and swe-bench-verified, using max same 30 turns:

CoAct can resolve 8/13
CodeAct v1.8 can resolve 9/13 which is IMO not really bad for an initial draft (without robust replanning tested and some prompt improvements to make it more reliable). I'm gonna try to run some more instances in the next few days to be more certain about its ability, but it might be a bit slow due to the heavy nature of the eval harness right now, and some other issues I ran into during evaluation: timeout issues with the runtime, a few multiprocessing issues when using more than 1 workers. I'm not sure how many instances should we run to get a reliable score for the implementation and accept the first draft (maybe into a separate branch). Can I get your thought on this? Also Cc @xingyaoww if you have some ideas to share.

Sep 11 '24 18:09 ryanhoangt

@neubig Hi Prof., till now I tested on a few (13) swe-bench instances that are mutual between swe-bench-lite and swe-bench-verified, using max same 30 turns:

CoAct can resolve 8/13

CodeAct v1.8 can resolve 9/13 which is IMO not really bad for an initial draft (without robust replanning tested and some prompt improvements to make it more reliable). I'm gonna try to run some more instances in the next few days to be more certain about its ability, but it might be a bit slow due to the heavy nature of the eval harness right now, and some other issues I ran into during evaluation: timeout issues with the runtime, a few multiprocessing issues when using more than 1 workers. I'm not sure how many instances should we run to get a reliable score for the implementation and accept the first draft (maybe into a separate branch). Can I get your thought on this? Also Cc @xingyaoww if you have some ideas to share.

I think it is also important to track the cost for both as that would be an important factor to estimate whether CoAct is overkill at times and maybe that gets stuck in more local executor loops which may be avoided with a guidance in the loop.

Sep 11 '24 18:09 ketan1741

Hey @ryanhoangt , sounds good!

I'm not sure how many instances should we run to get a reliable score for the implementation and accept the first draft (maybe into a separate branch).

I'm actually not sure what "accept into a separate branch" means? This already is a separate branch, right?

But in terms of a target, maybe we could run the smaller set of 92 that you created. I think the eval harness is probably in better shape now, so maybe it won't be too bad?

Sep 12 '24 20:09 neubig

Great job! Sorry for all the nits, skip the ones that seem too picky, of course. 😄 As an extra thought: the 2 system prompts' examples look pretty much the same. Ideally if those could be a partial template or applied via code, it could reduce the duplication a little, maybe as a follow up PR?

Sep 14 '24 06:09 tobitege

After running the eval on a subset of 93 instances, it's quite disappointing to see that the performance for now is pretty bad 😢. CoAct only resolved 25/93 while CodeAct resolved 39/93.

Below are some plot of both's results:

Another thing I noticed is the list of medium issues that each resolved are pretty different:

I'm not sure if we should add this agent into agenthub and improve it in future PRs?

Sep 15 '24 06:09 ryanhoangt

Not sure if we can add this agent into agenthub and improve it in future PRs?

Thanks so much for running the eval on this! From the outside, it is a bit surprising that it is not at least equal in performance, imo. If you could, share with us the trajectory (if not already done) in e.g. Slack, we should do some investigation into the failed instances? Sometimes it's just one, small thing being responsible for multiple fails.

Sep 15 '24 06:09 tobitege

From the outside, it is a bit surprising that it is not at least equal in performance, imo.

It's quite unexpected to me as well. I will upload the trajectory to the visualizer shortly.

Sep 15 '24 07:09 ryanhoangt

@ketan1741 @tobitege I uploaded the trajectory to my viz here, as uploading a subset of eval on OpenHands's official space may confuse people. Maybe we can have a look to see if there are any obvious improvements to make.

One thing to note is because currently the eval process doesn't capture the trajectory for delegation, we can't see how the executor performed on the visualization. We can address this before running a new eval.

Sep 15 '24 16:09 ryanhoangt

One thing to note is because currently the eval process doesn't capture the trajectory for delegation, we can't see how the executor performed on the visualization. We can address this before running a new eval.

PR for that, untested at the moment: https://github.com/All-Hands-AI/OpenHands/issues/3881

Sep 16 '24 01:09 enyst

@ryanhoangt I read the CoAct paper today and found this Open PR. Thanks for implementing the approach as the idea is quite interesting and in theory should improve the overall performance.

I'm also a bit surprised to see the lower performance and CoAct approach failing on instances where a standalone CodeActAgent resolves the instance.

I'm not sure if we should add this agent into agenthub and improve it in future PRs?

Would it be possible to run eval on all 300 instances? I'd love to chat further on AllHands Slack as I'm exploring something similar and this idea has a lot of potential.

Sep 16 '24 15:09 jatinganhotra

@ryanhoangt @ketan1741 I'm debugging this a bit, and I don't know yet enough answers we need, but just so you know:

The PR is merged. Please see here for delegates trajectories on a simple test: HuggingFace or attached CoActPlannerAgent.zip
IMHO django__django-10914 looks like a fluke. It's funny... the agent has to find FILE_UPLOAD_PERMISSIONS, in the task FILE_UPLOAD_PERMISSION is used interchangeably with FILE_UPLOAD_PERMISSIONS, and the agent read the right file, but 1) on main it concluded that it has FILE_UPLOAD_PERMISSIONS and modifies it, 2) on branch it concluded it cannot find FILE_UPLOAD_PERMISSION so it has to add it and fails the task. In the example test above it succeeded (repeatedly) on the PR branch. 🤷
In part, we are making CodeAct a delegate and it's not yet ready for it. 😅 For example, a thing that happens is that it's using get_user_message() here which has to return a MessageAction. But a delegate doesn't have any MessageAction, and the method has no protection for delegates (it doesn't stop at start_id or AgentDelegateAction) so it's hijacking a MessageAction from the parent agent. This shouldn't happen (fix coming up). It doesn't influence anything else, but there might be other issues of this kind...? FWIW other delegates use something like this.

Sep 16 '24 21:09 enyst

@enyst thanks for the quick fix and the insightful comments.

Re the django__django-10914 instance, it's interesting to know there're some unclear specs here, I don't know which behavior I should advocate 😅. But I think there're not much we can do, it's more to do with the LLM than the agent implementation.

we are making CodeAct a delegate and it's not yet ready for it.

Yeah I can also imagine some other weird cases that might come up, and we will need more work to make the agent (with delegation) robustly usable on the platform.

Tbh this is also not a version that I expected initially (i.e., using the grouding test case to help the agent resolve the issue) and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps. Should I close this PR for now?

Hi @jatinganhotra, regarding the full evaluation, I think it will likely further lower the score, as the remaining instances are not included in swe-bench-verified. IMO It would be more beneficial to run the full eval after validating the agent gets good scores on this "verified" subset.

On All Hands's Slack we have a dedicated channel named "workflows" to discuss about this topic. We'd love to chat more about improvement ideas there.

Sep 17 '24 08:09 ryanhoangt

Tbh this is also not a version that I expected initially and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps.

I also hope that understanding and debugging swe-bench issues will get easier after PR 3808 (simpler execution flow in the backend, fix delegates) and PR 3413 (logging prompts).

Should I close this PR for now?

Sorry, that's not what I meant. Up to you, of course, but IMHO it's worth looking into it a bit more, fixing stuff around what happens. It is quite amazing that it solves 6 more than main, but for some reason we still don't know it breaks in 9. FWIW the 2 tests I worked with were from those 9 that pass on main but failed on branch, and they passed every time.

Two minor thoughts:

the PR sends the new actions as serialized in json, but CodeAct doesn't do that usually, it sends strings as in a dialogue with the llm. It may be less confusing to the LLM if we are consistent and add the new ones around here in a similar format, extracting the useful info. Small note, Claude doesn't do as well with JSON as GPT-4o in my experience, not because it can't but it seems to need some prompting or gets confused at times.
what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent or show that it doesn't affect its performance. Just my thought, not sure how others think about it?

Sep 17 '24 14:09 enyst

Sounds good, I can do the debugging in the next few days and try running a new eval to obtain the trajectory of the executor. In previous eval, bugs such as missing the </execute...> tag in the stop sequences also negatively affected the performance, as it made the planner to execute the plan itself. Also happy to hear other ideas on improving this!

Also just to clarify a bit, the 6/9 diff above are for instances at medium (15m-1h) level. There's also a diagram for instances at easy (<15m) level (the 2 instances you ran is in the red subset) Venn Diagram of Resolved Easy Instance IDs

I looked into the trajectory in this easy level, one observation that make CoAct failed on many instances is that the planner very often came up with plans that have more steps than needed (often the plan contained 3 phases but only 1 of them was actually required to resolve the issue).

Some questions:

the PR sends the new actions as serialized in json

Can you clarify this? I'm not sure I did any json serialization in the PR...

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent

I think there're no changes to CodeAct execution in this PR, except the message property of AgentDelegateAction (which CodeAct should not use in swe-bench). Can you point out some changes that you're concerned?

what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

Can you give an example of this?

Sep 17 '24 15:09 ryanhoangt

Can you clarify this? I'm not sure I did any json serialization in the PR...

🤔 I guess it could be that it's the default serialization, so if they're not added at that point in CodeAct, maybe it takes it by default? Not fully sure, but anyway I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Action
{
"id":20
"timestamp":"2024-09-16T19:46:29.172006"
"source":"agent"
"message":"The new test method has been added successfully. Now all three phases of the task are complete. Let's summarize what we've done:

1. Modified the FILE_UPLOAD_PERMISSIONS setting in global_settings.py to 0o644.
....

Sep 17 '24 16:09 enyst

I attach a couple of files that are very strange to me. They're a few of the prompts sent to the LLM during django__django-11964. prompts.zip

prompt_039.py is the first prompt to Planner after the Executor returns with success. It has an observation using JSON. The last lines:

...
</expected_content>
</execute_global_plan>

----------

OBSERVATION:
{'content': "Thank you for the update. The changes for all three phases have been successfully implemented. Let's review the modifications:\n\n1. We added the `__str__` method to the `TextChoices` class.\n2. We added the `__str__` method to the `IntegerChoices` class.\n3. We added the `__str__` method to the base `Choices` class.\n\nThese changes should address the issue of differing types for TextChoices and IntegerChoices fields. The `__str__` method in each class now ensures that the string representation of the enum value is returned.\n\nTo summarize the changes made:\n\n1. In the `Choices` base class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n2. In the `IntegerChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n3. In the `TextChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\nThese modifications should resolve the issue of differing types for TextChoices and IntegerChoices fields. The value returned by the getter of the field will now be of the same type as the one created by enum.Enum (enum value).\n\nThe task is now complete."}


ENVIRONMENT REMINDER: You have 21 turns left to complete the task. When finished reply with <finish></finish>.

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt:

prompt_040.py

You are an autonomous intelligent programming agent playing the role of a subordinate employee...
...
NOW, LET'S START!
The user message is: "Now, let's come up with 2 global plans sequentially.
- First, examine the codebase and locate the relevant code for the issue. Then we'll come up with the FIRST detailed plan with all the edits to resolve it.
- After the local agent finishes executing the first plan, navigate the codebase again and come up with the SECOND detailed plan to create exactly ONE unit test at the correct location to verify the change has actually resolved the issue. As the LAST phase, explicitly tell the executor to execute it after creating that test case. If the test failed and after debugging it the local executor believes the previous fixes are incorrect, request for a new plan and include the error with explanation for that request."

# Phases
## Phase 1
- description: Create a new test file for enum-related tests.
 ...
## Phase 3
- description: Execute the newly created test case.
- reason: We need to verify that our changes have resolved the issue and that the test passes.
- expected_state: The test case is executed and passes successfully.


ENVIRONMENT REMINDER: You have 20 turns left to complete the task. When finished reply with <finish></finish>.

Please correct me if I misunderstand how this should work. @ryanhoangt @ketan1741 I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

Sep 17 '24 21:09 enyst

I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Re the json in the visualizer, seems like it is because we don't format the finish action yet.

prompt_039.log - It has an observation using JSON.

Good catch, this seems to be another bug. Might be because this action is not handled properly:

return AgentFinishAction(thought=thought, outputs={'content': thought})

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt

Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.

I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to CodeActSWEAgent below, it can be used to steer the agent a bit to be better at a specific task, but I'm not sure the current "2 global plans" is the optimal way to go. In CodeActAgent there're many cases where the agent just fixed the issues without creating any tests.

https://github.com/All-Hands-AI/OpenHands/blob/41ddba84bd0d62f8a4bc48d08addde4b4269a687/evaluation/swe_bench/run_infer.py#L248-L290

Sep 18 '24 06:09 ryanhoangt

I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all.

Sep 18 '24 14:09 enyst

Yeah that makes sense, I can try doing that in the next run

Sep 18 '24 14:09 ryanhoangt

Okay finally the score is converging to what we want, thanks @enyst for all the improvement suggestions! On the subset of 93 verified instances, CoAct resolved 33/93 while CodeAct resolved 39/93.

Some plots:

Comparing instances resolved in each category, seems like CoAct doesn't perform very well on easy level instances:

I'm gonna upload the trajectories to Huggingface shortly.

Sep 26 '24 02:09 ryanhoangt

Cheers! This is great news. ❤️

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

Sep 26 '24 06:09 enyst

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

The trajectory is uploaded to the visualizer here. I'm going to run evaluation on all 300 instances with the remote runtime to see how it goes, also clean up code a bit and fix tests.

Sep 28 '24 04:09 ryanhoangt

Hello @ryanhoangt. Just checking in to see if this is something that you will continue working on? There's lots of changes that have gone in recently and don't want you to run into too many hard to resolve conflicts as it seems like it's an involved PR.

Nov 01 '24 15:11 mamoodi

Hey @mamoodi, thanks for checking in. I’m a bit tied up with other tasks at the moment, so I won’t be able to get back to this right away. Maybe we can close the PR for now and I will try to circle back when I have more bandwidth.

Nov 02 '24 06:11 ryanhoangt

As per Ryan's comment, I'm going to close this PR for now. Whenever Ryan is ready, it will be reopened. Thank you.

Nov 14 '24 16:11 mamoodi