feat(eval): loc acc evaluation
Description
Add localization evaluation.
Changes
- Loc evaluation for SWE-Bench
- Output loc evaluation results through running
- Incorporate task success into post instance-level evaluation
- Optional loc evaluation
Testing
- Runnable when setting
loc-evalto eitherTrueorFalse - Validated on
SWE-bench_Verifiedtest set - Validated through output checking
Many thanks! I will go ahead and revise everything accordingly!
Many thanks! I'm on my way to fixing them!
Hi @xhguo7 are you still working on this one?
Thank you so much for your kind reminder! I'm working on it, and will push an update soon!
Hello! I'm very sorry for the waiting!
I have updated the implementation of localization evaluation on SWE-Bench, which mainly contains the following:
- Main code: ./evaluation/benchmarks/swe_bench/loc_eval
- Bash run: ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh
- Usage README.md: ./evaluation/benchmarks/swe_bench/loc_eval/README.md
This implementation is completely post-processing, and has tested and validated on OpenHands (version: 0.45.0, commit: 848f692).
Thank you so much for your time and review! I would be more than happy to do any revisions if needed. Many thanks!
@xhguo7 is this ready for review? If so please click the Ready for review button so @neubig and @xingyaoww can be notified it needs a review.
@xhguo7 is this ready for review? If so please click the
Ready for reviewbutton so @neubig and @xingyaoww can be notified it needs a review.
Thank you so much for your kind reminder! My sincere apologies for the delay. I took a bit more time to test on more inference results and make some updates to better handle edge cases. It's now ready for review. Thanks again for your kind advice!
@xhguo7 can you fix the linter?
@xhguo7 can you fix the linter?
Got it! I'm working on it, and will make an update to fix this soon! Many thanks!
I don't know why it's so hard to merge this PR :D I'm trying to help get it merged but something is always failing. Sorry about that. Let me merge main into it one more time and see if anything improves.
@mamoodi it finally merges! Thank you!