OpenHands [Bug]: eval-this workflow is not working

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

Describe the bug and reproduction steps

Currently the eval-this workflow is not working, so we should fix it.

Idea:

Switch LM to Claude Haiku
Reduce to use a subset of SWE-bench instances to make it affordable
Run and make sure that it works

@csmith49 will take a look at this

OpenHands Installation

Docker command in README

OpenHands Version

No response

Operating System

None

Logs, Errors, Screenshots, and Additional Context

No response

Nov 18 '24 17:11 neubig

The eval-this workflow has two parts:

actual evaluation
integration tests.

openhands-agent has split them here:

https://github.com/All-Hands-AI/OpenHands/pull/5077

I can confirm that the new integration tests workflow works with Haiku:

https://github.com/enyst/playground/pull/8#issuecomment-2495543257
(browsing ones fail, I'm looking into that, that's a good thing in a way!)

I felt like we need the integration tests in their new form to be back in working state. We have removed them at some point from ./tests, and refactored them as external scripts like evals, also using Deepseek, but they weren't working either.

What do you think about this? Could we have a nightly for them - and maybe a label too, just in case needed -, also with Haiku?

IMHO it would be cool if we can also have a nightly on Deepseek or something. Because

Haiku has native function calling
Deepseek doesn't, so the runs use different prompt/code/conversion/pydantic serialization/etc (it really affects stuff IMHO)

These integration tests are just like 6 tests currently (and I'm working on a seventh), but they do try to cover some things in real-like use that we just don't have coverage elsewhere.

Nov 23 '24 17:11 enyst

Oh, also: at this time, the Deepseek API key defined on this repo is depleted. I doubt it's the original reason why the eval workflow wasn't working, but it looks like the first reason currently. 😅 Cc: @neubig

Nov 23 '24 18:11 enyst

This is my proposal on these: (source)

Nov 25 '24 21:11 enyst

Thanks so much for digging in to this @enyst! I unfortunately am a bit short on time to look at this, but @mamoodi if you'd be able to take a look I'd love your comments.

Nov 27 '24 14:11 neubig

I think we can remove that whole workflow eval-runner now..

Dec 05 '24 17:12 mamoodi

I think we can remove that whole workflow eval-runner now..

Sorry, does the new one give results for download? (the .jsonl and other files)

Dec 05 '24 17:12 enyst

You're right. That's missing. Have to see what the file limit size is for comments lol

Dec 05 '24 17:12 mamoodi

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jan 05 '25 02:01 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Feb 05 '25 01:02 github-actions[bot]

Alright I think we can close this now. Also I think we can remove the eval-runner workflow now right? :)

Feb 26 '25 16:02 mamoodi

Yes, I think so. Thank you!

Feb 26 '25 16:02 enyst