carlos comments

Results 17 comments of


                                            carlos

Feature for supplying installation instructions for arbitrary repos

Another possible solution would be to have a `install_script` attribute in the config that takes a filename or raw string that is executed instead of `install_env` in the environment setup....

May I ask where can I download the generated results from Claude and GPTs?

Hi! We've just uploaded them. You can download them [here](https://drive.google.com/drive/folders/1EnrKzGAnsb_NmZKyECGmA2DrAc8ZuJ80?usp=sharing). Let us know if you need anything else!

May I ask where can I download the generated results from Claude and GPTs?

Ah, 472 was only generated due to length constraints. We sample 25% uniformly from 2294, but some of that is longer than gpt-4-32k-0613's context window. (See the gpt4-32k-0613__SWE-Bench_bm25_27K for all...

What are expected to submit for the leaderboard integration?

We're still reviewing the process for evaluating submissions. For now, we'd prefer results with a public or soon-to-be-public paper or technical report and the generated patches we can use to...

Predictions for the following instance_ids were not found in the tasks file and will not be considered: SWE-agent__test-repo-i1

Hi @Hk669 Just to be clear, I noticed that you used a "test-repo" in your examples. I'm not sure if that's just a placeholder, but generally the evaluation process will...

Upper bound score by skilled human?

Hi @rawwerks, we're aware of this issue. We'll be updating this repository and future evaluations shortly with a solution that I think will be satisfying for everyone. In the mean...

Upper bound score by skilled human?

as mentioned by @Domiii, evaluation on SWE-bench Verified should resolve these concerns - where potential human upper bound should be near 100%. Closing this issue for now.

Understanding `get_test_directives` and `make_eval_script_list_py`

We don't currently run _all_ test cases for every repo. For many repositories, this is a necessary convenience since running all tests would be exorbitant and overkill. This assumption trades...

Failed to apply patch to container

We haven't been experiencing these issues with users' submissions recently. I'm going to close this issue for now, but please open a new issue if you continue experiencing problems.

Error occurred when executing `create_text_dataset`.

Okay, there's an issue with generation for the `train` split at the moment. Are you trying to generate instances for `train` or the `test` split? I'm not sure when we'll...