[Feat] Added FileEditAction to enable edits using diff format.
Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG We want to switch from our current agenskills style (similar to https://aider.chat/docs/benchmarks.html#diff-func) to Aider's diff block (https://aider.chat/docs/benchmarks.html#diff) and how that would improve the editing performance.
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR adds a new EditAction, with logic to parse these output in diff format in the CodeActParser (translating the diff into EditAction). The EditAction is then executed inside the runtime (for now, we simply parse the diff format into search and replace blocks and manually call the existing agentskills to perform the edit)
We use the <execute_edit> tags to enclose the diff format. For example,
<execute_edit>
demo.py
<<<<<<< SEARCH
print("hello")
=======
print("goodbye")
>>>>>>> REPLACE
</execute_edit>
Link of any specific issues this addresses #3650
Hey Raj, thanks so much for taking this on! Got a few questions, though:
- What is the reasoning behind moving this into an action/observation instead of keeping it in the agentskills? There seems to be a lot of overhead for a comparatively small amount of added lines that implement the action.
- The actual action then still uses the same agentskill methods to do the actual operation. What's the difference here to how Aider does its implementation?
- One problem with having it in the client runtime is that it then also needs to be ported into the remote runtime.
These are just first questions/thoughts. Do you have other things still on a to-do list (besides integration test mocks 😬) ?
Valid question @tobitege, I also thought so initially.
The primary motive here is to get the agent to produce the edit in a diff-format similar to aider (simply because LLMs are trained to do that better). This means we need to modify the prompt and get the agent to output the diff wrapped in <execute_edit> tag.
From there on, actions/observation is the only way to actuate the edit using the runtime.
@xingyaoww Can probably explain this better?
Valid question @tobitege, I also thought so initially.
The primary motive here is to get the agent to produce the edit in a diff-format similar to aider (simply because LLMs are trained to do that better). This means we need to modify the prompt and get the agent to output the diff wrapped in <execute_edit> tag. From there on, actions/observation is the only way to actuate the edit using the runtime.
@xingyaoww Can probably explain this better?
Oh, no, for that the "user_prompt.j2" template (codeact_agent folder) and the docstrings in the file_ops.py file are used for. :)
You might take a peek at the PromptManager (its a rather recent addition): https://github.com/All-Hands-AI/OpenHands/blob/5100d12cea2cd35c30a22e25fbac376b72ed0981/openhands/utils/prompt.py
Also, don't worry about the integration tests for now. We can regenerate the mock files for you when the time is right.
The actual action then still uses the same agentskill methods to do the actual operation. What's the difference here to how Aider does its implementation?
@tobitege Yeah, the main difference between this PR and our current agentskill implementation is that we expose an editing interface to the agent that does not require explicit string escape (e.g., you would need to escape strings when calling Python functions from agentskills - which causes weird bugs). Requiring LLM to return stuff using JSON is also shown to be pretty problematic due to the need to escape stuff (https://aider.chat/2024/08/14/code-in-json.html).
If this diff edit format worked (improved performance), we can implement this action directly on the runtime client's side like FileRead and FileWrite. If it didn't work - I'd say we don't bother.
@tobitege Yeah, the main difference between this PR and our current
agentskillimplementation is that we expose an editing interface to the agent that does not require explicit string escape (e.g., you would need to escape strings when calling Python functions from agentskills - which causes weird bugs). Requiring LLM to return stuff using JSON is also shown to be pretty problematic due to the need to escape stuff (https://aider.chat/2024/08/14/code-in-json.html).
Alright, sounds like it should help prevent a lot of issues this way. 👍
@RajWorking if you'd update the branch (either clicking the button or merge main), I could run the regeneration of the integration files for you, just let me know. :)
Performance on the Aider Bench: Slight improvement: (Number of passed tests: 84/133)
@RajWorking Can you plot before vs. after? Are you using testcases? Is there a change in average number of "turns" / average cost?
Once https://github.com/All-Hands-AI/OpenHands/pull/3829 and https://github.com/All-Hands-AI/OpenHands/pull/3836 are merged. I'll start an SWE-Bench Lite eval on this PR 😄
Probably also a good time to bump the version of CodeAct to 1.10 since there's significant prompt changes - i can push a commit
https://github.com/All-Hands-AI/OpenHands/pull/3777/files#diff-9e2cd489ef2b4ad5f779049fd4a91abe7adea0e5b038fa2a898d1688451ae0eeL38
@RajWorking We only got 74/300 on this branch, using the exact same protocol i did in this PR which got us 89/300. I'm going to re-run eval on the main branch now and investigate whether this degradation comes from the recent main branch or your editing.
In the meantime, feel free to check this eval output file for weird cases / spaces for improvements: claude-3-5-sonnet@20240620_maxiter_30_N_v1.10-no-hint.zip this
- I did try to get one instance (57) to a success, but failed after 8 attempts, in spite of trying a couple of tweaks.
- ENABLE_AUTO_LINT=true seems not being implemented, which could help in providing a more meaningful error message to the LLM in case of syntax errors
- One line I added to the system prompt for example:
SEARCH REQUIRES PROPER INDENTATION! If the assistant plans to add the line ' print(x)', it must fully write the line out, with ALL leading spaces before the code! - My first attempt was the worst as basically all observations had this "truncated" in the middle of it, even though I doubt it had more than 10K characters and I had no other limit specified.
- Wrong indendation seems to be a major problem, almost by design and not just for this PR's approach: if the LLM does manage to replace something, it is often not the full method and places something in the file, that doesn't "fit" with the code following it. -> maybe we need to change prompts to only replace full methods or something to prevent this.
- The regex for search is for above reasons, imo, way too strict. LLM's often skip inline comments or have other "coding styles" and requiring them to regurgitate the whole exact same code seems counterproductive, compared to just using line numbers - this is by design and not because of this PR
- maybe we could try to use the same "fuzzy" search and replace from the existing agent skills for the actual operation
Just checking in because I think there were major changes around file editing. Is this PR still valid and will be worked on?
Closing this PR. #3985 has fixed this issue.