SWE-bench May I ask where can I download the generated results from Claude and GPTs?

I have read your paper and really like this work. Could I ask where can I download the generated results from Claude and GPTs? These results are beneficial to our work.

Dec 12 '23 12:12 itaowei

Hi! We've just uploaded them. You can download them here.

Let us know if you need anything else!

Dec 13 '23 07:12 carlosejimenez

Thank you so much for your kind response! Have a nice day! : )

Dec 13 '23 10:12 itaowei

Hi! We've just uploaded them. You can download them here.

Let us know if you need anything else!

Thanks for your kind response!

With the help of your great scripts, I have reproduced the evaluation on the generation gpt-4-32k-0613__SWE-bench_oracle__gpt4-subset.jsonl but the result is different from the paper.

My result is shown below

gpt-4-32k-0613 Evaluation Report:
        None:      0
        Generated: 472
        With Logs: 472
        Applied:   76
        Resolved:  3

% Resolved	% Apply
0.64	16.10

while the results shown in Table 5 of your paper is as below

% Resolved	% Apply
1.74	13.20

Moreover, I find that there are 472 lines in the file named gpt-4-32k-0613__SWE-bench_oracle__gpt4-subset.jsonl which means 472 generated results but this number (472) is not equal to 25% * 2294.

Is this file missing some results? How can I reproduce the same results as the paper?

Dec 13 '23 13:12 itaowei

Ah, 472 was only generated due to length constraints. We sample 25% uniformly from 2294, but some of that is longer than gpt-4-32k-0613's context window. (See the gpt4-32k-0613__SWE-Bench_bm25_27K for all the instance ids). I'll look into our evaluation of those generations.

Dec 13 '23 13:12 carlosejimenez

Ah, 472 was only generated due to length constraints. We sample 25% uniformly from 2294, but some of that is longer than gpt-4-32k-0613's context window. (See the gpt4-32k-0613__SWE-Bench_bm25_27K for all the instance ids). I'll look into our evaluation of those generations.

Really thanks for your quick and kind response!

Thus the ratio number of apply is very similar to that in the paper. (76/574=13.24% ~ 13.20%) The only difference is the number of resolved samples. As shown in the paper, 10 (10/574=1.74%) samples are resolved. If you could provide the isntance_id of these resolved samples, I will look into logs and try to find the reason.

Dec 13 '23 14:12 itaowei

Are these the ones shared on www.swebench.com or different?

Feb 27 '24 20:02 moresearch

We have updated the paper recently to reflect the corrections alluded to in this task instance.

We'll also release the execution logs + predictions for all models run so far on SWE-bench via the website.

Closing the issue for now, but I will put an update down when the website is updated!

Apr 16 '24 03:04 john-b-yang

"This repository contains the predictions, execution logs, trajectories, and results for model inference + evaluation runs on the SWE-bench task." https://github.com/swe-bench/experiments

Apr 23 '24 16:04 moresearch

SWE-bench SWE-bench copied to clipboard

May I ask where can I download the generated results from Claude and GPTs?

SWE-bench
SWE-bench copied to clipboard