SWE-bench
SWE-bench copied to clipboard
May I ask where can I download the generated results from Claude and GPTs?
I have read your paper and really like this work. Could I ask where can I download the generated results from Claude and GPTs? These results are beneficial to our work.
Thank you so much for your kind response! Have a nice day! : )
Hi! We've just uploaded them. You can download them here.
Let us know if you need anything else!
Thanks for your kind response!
With the help of your great scripts, I have reproduced the evaluation on the generation gpt-4-32k-0613__SWE-bench_oracle__gpt4-subset.jsonl
but the result is different from the paper.
My result is shown below
gpt-4-32k-0613 Evaluation Report:
None: 0
Generated: 472
With Logs: 472
Applied: 76
Resolved: 3
% Resolved | % Apply |
---|---|
0.64 | 16.10 |
while the results shown in Table 5 of your paper is as below
% Resolved | % Apply |
---|---|
1.74 | 13.20 |
Moreover, I find that there are 472 lines in the file named gpt-4-32k-0613__SWE-bench_oracle__gpt4-subset.jsonl
which means 472 generated results but this number (472) is not equal to 25% * 2294.
Is this file missing some results? How can I reproduce the same results as the paper?
Ah, 472 was only generated due to length constraints. We sample 25% uniformly from 2294, but some of that is longer than gpt-4-32k-0613's context window. (See the gpt4-32k-0613__SWE-Bench_bm25_27K for all the instance ids). I'll look into our evaluation of those generations.
Ah, 472 was only generated due to length constraints. We sample 25% uniformly from 2294, but some of that is longer than gpt-4-32k-0613's context window. (See the gpt4-32k-0613__SWE-Bench_bm25_27K for all the instance ids). I'll look into our evaluation of those generations.
Really thanks for your quick and kind response!
Thus the ratio number of apply
is very similar to that in the paper. (76/574=13.24% ~ 13.20%)
The only difference is the number of resolved
samples. As shown in the paper, 10 (10/574=1.74%) samples are resolved. If you could provide the isntance_id
of these resolved
samples, I will look into logs and try to find the reason.
Are these the ones shared on www.swebench.com or different?
We have updated the paper recently to reflect the corrections alluded to in this task instance.
We'll also release the execution logs + predictions for all models run so far on SWE-bench via the website.
Closing the issue for now, but I will put an update down when the website is updated!
"This repository contains the predictions, execution logs, trajectories, and results for model inference + evaluation runs on the SWE-bench task." https://github.com/swe-bench/experiments