Fix swe bench modal
- [ ] This change is worth documenting at https://docs.all-hands.dev/
- [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
End-user friendly description of the problem this fixes or functionality this introduces.
Summarize what the PR does, explaining any non-trivial design decisions.
Link of any specific issues this addresses:
To run this PR locally, use the following command:
docker run -it --rm -p 3000:3000 -v /var/run/docker.sock:/var/run/docker.sock --add-host host.docker.internal:host-gateway -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:06314b2-nikolaik --name openhands-app-06314b2 docker.all-hands.dev/all-hands-ai/openhands:06314b2
@xingyaoww Could we update the remote eval command line parameters to run run_eval.sh with the 5th parameter βmodalβ? The parameter introduced by @ryanhoangt here.
So that we can run an eval-50 labeled workflow on this PR to see how it works.
@enyst yep! I think @ryanhoangt is on it now!
Running evaluation on the PR. Once eval is done, the results will be posted.
Running evaluation on the PR. Once eval is done, the results will be posted.
Evaluation results (Auto Reply): Evaluation failed to complete. Someone from All-Hands-AI needs to investigate. - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-187_25-06-21-23-51.tar.gz
Running evaluation on the PR. Once eval is done, the results will be posted.
Evaluation results (Auto Reply): Evaluation failed to complete. Someone from All-Hands-AI needs to investigate. - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-89_25-06-23-04-04.tar.gz
What doesn't work? π @ryanhoangt
@enyst There were a few issues that I needed to investigate and it took a bit long π This should be ready for testing now
Running evaluation on the PR. Once eval is done, the results will be posted.
Oh, no problem π Fingers crossed!
Evaluation results (Auto Reply): ## Summary
- submitted instances: 2
- empty patch instances: 0
- resolved instances: 2
- unresolved instances: 0
- error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-70_25-06-25-15-56.tar.gz
Running evaluation on the PR. Once eval is done, the results will be posted.
Nice!
Evaluation results (Auto Reply): ## Summary
- submitted instances: 50
- empty patch instances: 0
- resolved instances: 29
- unresolved instances: 21
- error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-476_25-06-25-17-48.tar.gz
OK...
- the latest run-50 these days, with the old
eval_remote, were 25/50 and 27/50 - this is 29/50
- we've seen 30-32 too IIRC, for this subset, on local runs.
Does this 29/50 look good enough, @xingyaoww , or should we dig further?
@ryanhoangt for some reason we are still using sonnet 3.7 for eval in the eval job, no wonder this is so low π
We should fix that
Running evaluation on the PR. Once eval is done, the results will be posted.
Looks like there are a few issues preventing this PR from being merged!
- GitHub Actions are failing:
- Run Python Unit Tests
- Docker
If you'd like me to help, just leave a comment, like
@OpenHands please fix the failing actions on PR #9242
Feel free to include any additional details that might help me get this PR into a better state.
You can manage your notification settings
Running evaluation on the PR. Once eval is done, the results will be posted.
Evaluation results (Auto Reply): ## Summary
- submitted instances: 50
- empty patch instances: 0
- resolved instances: 34
- unresolved instances: 16
- error instances: 0 - Download at: https://github.com/All-Hands-AI/pr-eval-results/releases/download/1.0.0/swe-bench-modal-633_25-06-26-15-31.tar.gz
34/50=68%
Nice!
We can go with it and merge this PR?
Let's go! We should find ways to keep track of every evaluation results here π , maybe like what we did in our integration tests