OpenHands Fix swe bench modal

[ ] This change is worth documenting at https://docs.all-hands.dev/
[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality this introduces.

Summarize what the PR does, explaining any non-trivial design decisions.

Link of any specific issues this addresses:

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:06314b2-nikolaik   --name openhands-app-06314b2   docker.all-hands.dev/all-hands-ai/openhands:06314b2

Jun 19 '25 17:06 enyst

@xingyaoww Could we update the remote eval command line parameters to run run_eval.sh with the 5th parameter “modal”? The parameter introduced by @ryanhoangt here.

So that we can run an eval-50 labeled workflow on this PR to see how it works.

Jun 19 '25 17:06 enyst

@enyst yep! I think @ryanhoangt is on it now!

Jun 20 '25 19:06 xingyaoww

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 21 '25 00:06 github-actions[bot]

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 21 '25 23:06 github-actions[bot]

Evaluation results (Auto Reply): Evaluation failed to complete. Someone from All-Hands-AI needs to investigate. - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-187_25-06-21-23-51.tar.gz

Jun 21 '25 23:06 mamoodi

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 23 '25 03:06 github-actions[bot]

Evaluation results (Auto Reply): Evaluation failed to complete. Someone from All-Hands-AI needs to investigate. - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-89_25-06-23-04-04.tar.gz

Jun 23 '25 04:06 mamoodi

What doesn't work? 😓 @ryanhoangt

Jun 23 '25 18:06 enyst

@enyst There were a few issues that I needed to investigate and it took a bit long 😅 This should be ready for testing now

Jun 25 '25 15:06 ryanhoangt

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 25 '25 15:06 github-actions[bot]

Oh, no problem 😅 Fingers crossed!

Jun 25 '25 15:06 enyst

Evaluation results (Auto Reply): ## Summary

submitted instances: 2
empty patch instances: 0
resolved instances: 2
unresolved instances: 0
error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-70_25-06-25-15-56.tar.gz

Jun 25 '25 15:06 mamoodi

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 25 '25 16:06 github-actions[bot]

Nice!

Jun 25 '25 16:06 xingyaoww

Evaluation results (Auto Reply): ## Summary

submitted instances: 50
empty patch instances: 0
resolved instances: 29
unresolved instances: 21
error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-476_25-06-25-17-48.tar.gz

Jun 25 '25 17:06 mamoodi

OK...

the latest run-50 these days, with the old eval_remote, were 25/50 and 27/50
this is 29/50
we've seen 30-32 too IIRC, for this subset, on local runs.

Does this 29/50 look good enough, @xingyaoww , or should we dig further?

Jun 25 '25 17:06 enyst

@ryanhoangt for some reason we are still using sonnet 3.7 for eval in the eval job, no wonder this is so low 😓

We should fix that

Jun 25 '25 17:06 xingyaoww

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 26 '25 14:06 github-actions[bot]

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Python Unit Tests
- Docker

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #9242

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Jun 26 '25 14:06 openhands-ai[bot]

Running evaluation on the PR. Once eval is done, the results will be posted.

Jun 26 '25 14:06 github-actions[bot]

Evaluation results (Auto Reply): ## Summary

submitted instances: 50
empty patch instances: 0
resolved instances: 34
unresolved instances: 16
error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-633_25-06-26-15-31.tar.gz

Jun 26 '25 15:06 mamoodi

34/50=68%

Nice!

Jun 26 '25 15:06 xingyaoww

We can go with it and merge this PR?

Jun 26 '25 16:06 enyst

Let's go! We should find ways to keep track of every evaluation results here 👀 , maybe like what we did in our integration tests

Jun 26 '25 16:06 xingyaoww