OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

feat(eval): loc acc evaluation

Open xhguo7 opened this issue 8 months ago • 2 comments

Description

Add localization evaluation.

Changes

  • Loc evaluation for SWE-Bench
  • Output loc evaluation results through running
  • Incorporate task success into post instance-level evaluation
  • Optional loc evaluation

Testing

  • Runnable when setting loc-eval to either True or False
  • Validated on SWE-bench_Verified test set
  • Validated through output checking

xhguo7 avatar May 15 '25 05:05 xhguo7

Many thanks! I will go ahead and revise everything accordingly!

xhguo7 avatar May 15 '25 06:05 xhguo7

Many thanks! I'm on my way to fixing them!

xhguo7 avatar May 20 '25 19:05 xhguo7

Hi @xhguo7 are you still working on this one?

neubig avatar Jun 23 '25 18:06 neubig

Thank you so much for your kind reminder! I'm working on it, and will push an update soon!

xhguo7 avatar Jun 23 '25 20:06 xhguo7

Hello! I'm very sorry for the waiting!

I have updated the implementation of localization evaluation on SWE-Bench, which mainly contains the following:

  • Main code: ./evaluation/benchmarks/swe_bench/loc_eval
  • Bash run: ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh
  • Usage README.md: ./evaluation/benchmarks/swe_bench/loc_eval/README.md

This implementation is completely post-processing, and has tested and validated on OpenHands (version: 0.45.0, commit: 848f692).

Thank you so much for your time and review! I would be more than happy to do any revisions if needed. Many thanks!

xhguo7 avatar Jun 25 '25 14:06 xhguo7

@xhguo7 is this ready for review? If so please click the Ready for review button so @neubig and @xingyaoww can be notified it needs a review.

mamoodi avatar Jun 27 '25 12:06 mamoodi

@xhguo7 is this ready for review? If so please click the Ready for review button so @neubig and @xingyaoww can be notified it needs a review.

Thank you so much for your kind reminder! My sincere apologies for the delay. I took a bit more time to test on more inference results and make some updates to better handle edge cases. It's now ready for review. Thanks again for your kind advice!

xhguo7 avatar Jul 01 '25 03:07 xhguo7

@xhguo7 can you fix the linter?

xingyaoww avatar Jul 03 '25 20:07 xingyaoww

@xhguo7 can you fix the linter?

Got it! I'm working on it, and will make an update to fix this soon! Many thanks!

xhguo7 avatar Jul 03 '25 23:07 xhguo7

I don't know why it's so hard to merge this PR :D I'm trying to help get it merged but something is always failing. Sorry about that. Let me merge main into it one more time and see if anything improves.

mamoodi avatar Jul 10 '25 18:07 mamoodi

@mamoodi it finally merges! Thank you!

xingyaoww avatar Jul 10 '25 19:07 xingyaoww