[Bug]: eval-this workflow is not working
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Describe the bug and reproduction steps
Currently the eval-this workflow is not working, so we should fix it.
Idea:
- Switch LM to Claude Haiku
- Reduce to use a subset of SWE-bench instances to make it affordable
- Run and make sure that it works
@csmith49 will take a look at this
OpenHands Installation
Docker command in README
OpenHands Version
No response
Operating System
None
Logs, Errors, Screenshots, and Additional Context
No response
The eval-this workflow has two parts:
- actual evaluation
- integration tests.
openhands-agent has split them here:
- https://github.com/All-Hands-AI/OpenHands/pull/5077
I can confirm that the new integration tests workflow works with Haiku:
- https://github.com/enyst/playground/pull/8#issuecomment-2495543257
- (browsing ones fail, I'm looking into that, that's a good thing in a way!)
I felt like we need the integration tests in their new form to be back in working state. We have removed them at some point from ./tests, and refactored them as external scripts like evals, also using Deepseek, but they weren't working either.
What do you think about this? Could we have a nightly for them - and maybe a label too, just in case needed -, also with Haiku?
IMHO it would be cool if we can also have a nightly on Deepseek or something. Because
- Haiku has native function calling
- Deepseek doesn't, so the runs use different prompt/code/conversion/pydantic serialization/etc (it really affects stuff IMHO)
These integration tests are just like 6 tests currently (and I'm working on a seventh), but they do try to cover some things in real-like use that we just don't have coverage elsewhere.
Oh, also: at this time, the Deepseek API key defined on this repo is depleted. I doubt it's the original reason why the eval workflow wasn't working, but it looks like the first reason currently. 😅 Cc: @neubig
Thanks so much for digging in to this @enyst! I unfortunately am a bit short on time to look at this, but @mamoodi if you'd be able to take a look I'd love your comments.
I think we can remove that whole workflow eval-runner now..
I think we can remove that whole workflow eval-runner now..
Sorry, does the new one give results for download? (the .jsonl and other files)
You're right. That's missing. Have to see what the file limit size is for comments lol
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Alright I think we can close this now. Also I think we can remove the eval-runner workflow now right? :)
Yes, I think so. Thank you!