Add inference for SWT-Bench (with CI)
- [ ] This change is worth documenting at https://docs.all-hands.dev/
- [x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
End-user friendly description of the problem this fixes or functionality that this introduces. This adds the inference step for the SWT-Bench benchmark of OpenHands, with and without CI setup, as evaluated on https://swtbench.com.
Give a summary of what the PR does, explaining any non-trivial design decisions. The PR adds an instruction to the SWE-Bench setup that allows running inference for the SWT-Bench setting. Moreover it adds setup scripts and test commands specific to individual SWE-Bench instances to aid the agent in correctly running the test suite of the respective instances.
These changes can be controlled via the new mode argument to the inference script.
Link of any specific issues this addresses. https://x.com/allhands_ai/status/1899546055143723142 ;)
Note that this does not touch the evaluation script for swe-bench. The generated diffs were simply extracted using this script (https://github.com/nielstron/OpenHands/blob/main/extract_predictions.py) and executed using the swt-bench evaluation harness, like this
python -m src.main \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path inference_output/openhands_lite.jsonl \
--max_workers 12 \
--run_id openhands_lite --patch_types vanilla --build_mode api
Moreover I have not added any documentation yet to the main README in the evaluation directory.
Any comments on how to proceed are appreciated. I will likely not have time to adapt the evaluation script for swt-bench.
@xingyaoww Any thoughts on moving the SWE-gym control flow to the mode mechanism used here?
@csmith49 good idea! we should probably do that after this PR gets merged
@nielstron will you be looking at addressing the comments?
I planned to address the comments but was rather busy the last weeks. I can take a look tomorrow!
I added the suggested documentation to the README and the MODE parameter to run_infer. I will check how this change affects run_eval shortly.
I noticed that outputs.jsonl contains diffs in which .strip() was applied (in this line). This stripping is not due to my PR and breaks some generated diffs (due to mismatching hunk length) - it might be worth looking into.
@juanmichelini I was able to reproduce the error you reported... also on the main branch. I think the issue is that the script tries to proceed to report results even if the overall eval script failed.
I was able to produce a correct eval on this branch by specifying the instance id and dataset to evaluate:
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt-4o-2024-11-20_maxiter_100_N_v0.31.0-no-hint-run_1/output.jsonl "scikit-learn__scikit-learn-13439" princeton-nlp/SWE-bench_Verified
@csmith49 thanks for the comments, I have incorporated them as suggested!