OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Implement new agent using AutoCodeRover's approach

Open foragerr opened this issue 10 months ago • 1 comments

AutoCodeRover from NUS claims 22% on swe-bench-lite. Their approach constructs an AST from a repo codebase to identify where in the code a patch needs to be applied.

Implement an agent based on ACR's approach.

https://arxiv.org/abs/2404.05427 https://github.com/nus-apr/auto-code-rover

foragerr avatar Apr 09 '24 13:04 foragerr

It now supports running on GitHub and local issues!

ghost avatar Apr 29 '24 06:04 ghost

I don't think implementing autocoderover is high-priority given that we have better performance! https://xwang.dev/blog/2024/opendevin-codeact-1.0-swebench/

neubig avatar May 09 '24 12:05 neubig

AutoCodeRover authors actually claim to resolve ~22% issues of SWE-bench lite. Why is the blog post https://xwang.dev/blog/2024/opendevin-codeact-1.0-swebench/ saying AutoCodeRover achieves just 16%?

dsemba avatar May 11 '24 18:05 dsemba

@dsemba see here: https://github.com/OpenDevin/OpenDevin/issues/1693#issuecomment-2105046751

neubig avatar May 11 '24 18:05 neubig

Sorry for commenting on this closed issue and thank you for your interests in AutoCodeRover!

I would like to update on the pass@1 and pass@3 scores in the original AutoCodeRover paper. Turns out that the SWE-bench evaluation environment used in our original experiments gives underestimated scores due to missing of system-level dependencies. Some correct patches were deemed as wrong after running the SWE-bench acceptance tests in that environment.

Thanks to the SWE-bench-docker project, our original patches were re-evaluated, and the actual pass@1 score is 19% (instead of 16%), and the pass@3 score is 26% (instead of 22%). More details can be found here.

The 19% pass@1 score is also reflected on SWE-bench leaderboard.

yuntongzhang avatar Jun 24 '24 15:06 yuntongzhang

I don't think implementing autocoderover is high-priority given that we have better performance! xwang.dev/blog/2024/opendevin-codeact-1.0-swebench

@neubig I wouldn't necessarily claim that having a higher overall score means that OpenDevin couldn't benefit even more from techniques used in AutoCodeRover (or some other tool).

IMO, to 'properly' make that assessment you would need to be able to isolate/test how well their methods (eg. AST construction/search) compare against OpenDevin's equivalent methods. It may be that OpenDevin currently does better due to other parts, but could benefit from the new technique used here.

Though perhaps you have already looked deeper than the above comment suggests, and so have a more 'evidenced' view as to why you don't think there would be improvements to be gained.

  • https://arxiv.org/abs/2404.05427
  • [..snip..] In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM's understanding of the issue's root cause, and effectively retrieve a context via iterative search.[..snip..]


In the space of AST parsing / better code 'repomaps', see also:

  • #742

0xdevalias avatar Jun 25 '24 05:06 0xdevalias

Also, looking at the repo, looks like AutoCodeRover is now much higher than OpenDevin on SWE-Bench lite (at least based on the 22% reported in the linked blog post):

  • https://github.com/nus-apr/auto-code-rover
    • [June 20, 2024] AutoCodeRover now achieves 30.67% efficacy (pass@1) on SWE-bench-lite!

  • https://www.swebench.com/
    • It's currently sitting at number 3 on the leaderboard.
    • image
    • https://github.com/swe-bench/experiments/pull/11
      • The re-evaluated pass@1 score is 19% on SWE-bench-Lite. This PR contains the re-evaluated results from one of the original runs with AutoCodeRover-v20240408.

    • https://github.com/swe-bench/experiments/pull/31
      • In the past month we have been developing AutoCodeRover, and now it's achieving 92/300 (30.67%) on lite.

      • Second, I noticed that your README mentioned that this version of ACR is not open source. Is there a plan to release the code? It'd be a great service to the community + I'm sure a lot of people would be really interested in understanding the strong results 😄

      • Sure, we will certainly update our citation of SWE-agent! The code may be released at a later date with a report.


Edit: Maybe these results are more relevant/up to date than that OpenDevin blog post though?

  • https://huggingface.co/spaces/OpenDevin/evaluation

Which seems like OpenDevin CodeActAgent (v1.3) + gpt-4o-2024-05-13 gets 26.67% on SWE-Bench lite; so still getting beaten by the latest AutoCodeRover.

0xdevalias avatar Jun 25 '24 05:06 0xdevalias