SWE-bench icon indicating copy to clipboard operation
SWE-bench copied to clipboard

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

Results 69 SWE-bench issues
Sort by recently updated
recently updated
newest added

Enhancements to run_evaluation.py: This pull request adds an optional command-line argument `--path_conda` to the run_evaluation.py script. The main functionality enhancement is the ability to specify a custom path to a...

I would be extremely grateful if you could share the log files from the baseline model testing process discussed in your study. Access to these logs, specifically those with names...

Check https://www.swebench.com and found: ![image](https://github.com/princeton-nlp/SWE-bench/assets/8592144/6c2e7ed9-3bc4-4046-a89f-a3f6d64a2533)

It would be nice to have a "Experiments" directory for reproducible research via MLflow or a similar tool.

I have read your paper and really like this work. Could I ask where can I download the generated results from Claude and GPTs? These results are beneficial to our...

[HuggingFace Endpoints](https://huggingface.co/inference-endpoints/dedicated) seems to be an easier way to run SWE-Llama models on the cloud. Tips for coding this feature?

Thank you for providing a great data set. I have tested it in my environment and confirmed that the fail to pass test does not fail in the following instances,...

evaluation

Firstly, thanks for this useful dataset. Some of the pytest logs have a format such as this: ``` PASSED sklearn/feature_extraction/tests/test_text.py::test_callable_analyzer_error[file-AttributeError-'str' object has no attribute 'read'-CountVectorizer] PASSED sklearn/feature_extraction/tests/test_text.py::test_callable_analyzer_error[file-AttributeError-'str' object has no...

evaluation

`harness/run_evaluation.py` takes a `--log_dir` argument but if there are multiple predictions for a single model and test in the `--predictions_path` file they all write to a single file in the...

This is a very challenging benchmark, I have learned a lot from it. Thank you for the effort you have put into this. I tested using the swe-llama13b you provided...

evaluation