blarify Evaluate performance against SWE-Bench

It would be interesting to see if/how blar performs against the SWE-Bench benchmarks:

https://www.swebench.com/
https://github.com/princeton-nlp/SWE-bench
- [ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
https://arxiv.org/abs/2310.06770
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

Apr 05 '24 01:04 0xdevalias

We're on it and can keep you updated. We've created a Discord server where we will post our progress.

Apr 05 '24 13:04 berrazuriz1

@berrazuriz1 Sounds good; though I generally find Discord a super noisy/inefficient way to try and follow updates (particularly given how every project/etc seems to have one these days).

Hopefully you can post any 'major progress' milestones to this issue/similar as well?

Apr 06 '24 04:04 0xdevalias

@v4rgas I see you closed this as completed.. are you able to link to/update this issue with the results to maintain continuity? Searching the leaderboards for blar, I didn't see any results?

Feb 21 '25 03:02 0xdevalias