OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Fix swe bench modal

Open enyst opened this issue 6 months ago β€’ 2 comments

  • [ ] This change is worth documenting at https://docs.all-hands.dev/
  • [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality this introduces.


Summarize what the PR does, explaining any non-trivial design decisions.


Link of any specific issues this addresses:


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:06314b2-nikolaik   --name openhands-app-06314b2   docker.all-hands.dev/all-hands-ai/openhands:06314b2

enyst avatar Jun 19 '25 17:06 enyst

@xingyaoww Could we update the remote eval command line parameters to run run_eval.sh with the 5th parameter β€œmodal”? The parameter introduced by @ryanhoangt here.

So that we can run an eval-50 labeled workflow on this PR to see how it works.

enyst avatar Jun 19 '25 17:06 enyst

@enyst yep! I think @ryanhoangt is on it now!

xingyaoww avatar Jun 20 '25 19:06 xingyaoww

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 21 '25 00:06 github-actions[bot]

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 21 '25 23:06 github-actions[bot]

Evaluation results (Auto Reply): Evaluation failed to complete. Someone from All-Hands-AI needs to investigate. - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-187_25-06-21-23-51.tar.gz

mamoodi avatar Jun 21 '25 23:06 mamoodi

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 23 '25 03:06 github-actions[bot]

Evaluation results (Auto Reply): Evaluation failed to complete. Someone from All-Hands-AI needs to investigate. - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-89_25-06-23-04-04.tar.gz

mamoodi avatar Jun 23 '25 04:06 mamoodi

What doesn't work? πŸ˜“ @ryanhoangt

enyst avatar Jun 23 '25 18:06 enyst

@enyst There were a few issues that I needed to investigate and it took a bit long πŸ˜… This should be ready for testing now

ryanhoangt avatar Jun 25 '25 15:06 ryanhoangt

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 25 '25 15:06 github-actions[bot]

Oh, no problem πŸ˜… Fingers crossed!

enyst avatar Jun 25 '25 15:06 enyst

Evaluation results (Auto Reply): ## Summary

  • submitted instances: 2
  • empty patch instances: 0
  • resolved instances: 2
  • unresolved instances: 0
  • error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-70_25-06-25-15-56.tar.gz

mamoodi avatar Jun 25 '25 15:06 mamoodi

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 25 '25 16:06 github-actions[bot]

Nice!

xingyaoww avatar Jun 25 '25 16:06 xingyaoww

Evaluation results (Auto Reply): ## Summary

  • submitted instances: 50
  • empty patch instances: 0
  • resolved instances: 29
  • unresolved instances: 21
  • error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-476_25-06-25-17-48.tar.gz

mamoodi avatar Jun 25 '25 17:06 mamoodi

OK...

  • the latest run-50 these days, with the old eval_remote, were 25/50 and 27/50
  • this is 29/50
  • we've seen 30-32 too IIRC, for this subset, on local runs.

Does this 29/50 look good enough, @xingyaoww , or should we dig further?

enyst avatar Jun 25 '25 17:06 enyst

@ryanhoangt for some reason we are still using sonnet 3.7 for eval in the eval job, no wonder this is so low πŸ˜“ image

We should fix that

xingyaoww avatar Jun 25 '25 17:06 xingyaoww

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 26 '25 14:06 github-actions[bot]

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Python Unit Tests
    • Docker

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #9242

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

openhands-ai[bot] avatar Jun 26 '25 14:06 openhands-ai[bot]

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions[bot] avatar Jun 26 '25 14:06 github-actions[bot]

Evaluation results (Auto Reply): ## Summary

  • submitted instances: 50
  • empty patch instances: 0
  • resolved instances: 34
  • unresolved instances: 16
  • error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-633_25-06-26-15-31.tar.gz

mamoodi avatar Jun 26 '25 15:06 mamoodi

34/50=68%

Nice!

xingyaoww avatar Jun 26 '25 15:06 xingyaoww

We can go with it and merge this PR?

enyst avatar Jun 26 '25 16:06 enyst

Let's go! We should find ways to keep track of every evaluation results here πŸ‘€ , maybe like what we did in our integration tests

xingyaoww avatar Jun 26 '25 16:06 xingyaoww