OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

OpenHands model performance data

Open bkutasi opened this issue 8 months ago • 11 comments

Hi! I saw that a lot of effort is being put into making various agents work but not much info on which agents are working best with what. Since various LLMs came out, and also many open models are avaiable i wonder how they do with OpenHands. I could find this on SWE_bench: "OpenHands + 4x Scaled (2024-02-03)" is at 5th palce on SWE-verified, while on lite, "OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022)" is the best. But there is no info about the latest Sonnet model, or on others like Gemini, Deepseek V3 etc.

bkutasi avatar Mar 24 '25 20:03 bkutasi

Please see our recommended models: https://docs.all-hands.dev/modules/usage/llms

In the team we mostly use sonnet 3.7 and 3.5. The others haven't come close to it yet. There is a spreadsheet linked there that shows some models like Deepseek v3.

mamoodi avatar Mar 25 '25 14:03 mamoodi

I read the docu before opening the ticket, but I felt that adding more information would help a lot. Eg.: aider has the polyglot benchmark with the associated cost for each run. https://aider.chat/docs/leaderboards/ Its also worth watching out for upcoming models, the latest DeepSeek V3 (0324) scores better(55.1%) than claude-3-5-sonnet-20241022 (51.6%) on that polyglot bench. Still, it's a bit further from the current sonnet thinking mode(64.9%), but it's 10-20x cheaper, or free. In OpenHands benchmark in the spreadsheet you mentioned, it still scores way below claude-3-5-sonnet-20241022.

Maybe for easier tasks, it's worth switching to cheaper models. But the documentation doesn't explain what each agent is best for. The spreadsheet: https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0

bkutasi avatar Mar 25 '25 17:03 bkutasi

Hello! In regard of OpenHands + 4x Scaled, what "4x Scaled" means?

All I could find is https://github.com/SWE-bench/experiments/pull/174

livingstonlarus avatar Mar 26 '25 15:03 livingstonlarus

+1, what does 4x mean?

ryx2 avatar Mar 27 '25 20:03 ryx2

Looks like we have a new local contender: https://huggingface.co/all-hands/openhands-lm-32b-v0.1 Scores on the top local llms (if you cant run deepseek)! https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0

Also nicely added to the docs for newcomers: https://docs.all-hands.dev/modules/usage/llms/local-llms

bkutasi avatar Mar 31 '25 20:03 bkutasi

Gemini 2.5 Pro Experimental claims a 63.8% success rate on SWE-bench Verified with a custom agent setup (See blog post). Is OpenHands able to replicate that with the same model?

drice avatar Apr 01 '25 14:04 drice

Not yet, @drice, as far as I know we are still running an initial evaluation, due to the terrible rate limits that Google set on it. (it's an "experimental" model)

On a side note, I haven't been able to find information about the "custom agent setup", anyone has an idea what it implies? I can guess only that it's the unreleased agent that reportedly Google is working on.

enyst avatar Apr 01 '25 21:04 enyst

@enyst Thanks! I'm using it through Vertex AI and there are no rate limits I've encountered while still being free.

drice avatar Apr 01 '25 21:04 drice

Not yet, @drice, as far as I know we are still running an initial evaluation, due to the terrible rate limits that Google set on it. (it's an "experimental" model)

On a side note, I haven't been able to find information about the "custom agent setup", anyone has an idea what it implies? I can guess only that it's the unreleased agent that reportedly Google is working on.

If I'd had to guess, they are probably working with the internal Jules team to make the agent better. I cant find much information about it either, Jules is in closed testing. Regarding the rate limits, I've been using it with RooCode basically without limits recently after I added a billing account to my google cloud. The billing card can even be an empty card, so no threat of google switching out the model from free into billing. Limits are 20 RPM and 1 million tokens / min.

bkutasi avatar Apr 02 '25 07:04 bkutasi

@drice

@enyst Thanks! I'm using it through Vertex AI and there are no rate limits I've encountered while still being free.

following your comment I wanted to check the same for myself (using openhands with gemini 2.5) but I am getting API key missing error even when I have followed the instructions, it will be of great help to me if you will be so kind to answer these three questions

which version of open hands are you running? what all env variable are you passing to the docker command? what setup you are following settings page?

amankhandelia avatar Apr 03 '25 12:04 amankhandelia

@drice

@enyst Thanks! I'm using it through Vertex AI and there are no rate limits I've encountered while still being free.

following your comment I wanted to check the same for myself (using openhands with gemini 2.5) but I am getting API key missing error even when I have followed the instructions, it will be of great help to me if you will be so kind to answer these three questions

which version of open hands are you running? what all env variable are you passing to the docker command? what setup you are following settings page?

@enyst Thanks! I'm using it through Vertex AI and there are no rate limits I've encountered while still being free.

Okay so from what I gather is that you're using Gemini 2.5 on the Vertex AI platform by itself solely. So from my understanding you're not actually using their custom agent set up where you actually have to create a service account with the IAM privileges and then connect that to your open hands docker container via your service account JSON. Then once you connect that then you configure the open hands model to actually pick from the Vertex model. Then you have to do advanced, then you do Vertex, then you do Google Gemini 2.5 pro with the exact date. So yeah I did all of that and then I actually figured out the specific format that they were looking for the model name and after I did that I started getting a what is it a rate limit error or not even a rate limit error a max token limit error saying that I only can do or that I was trying to do 6,355 and one you need to do one or something like that so I was basically seeing like is there a way to edit or change the max token limit or if that is a bug or if you've even or if anybody has actually even tested this approach?

jmanhype avatar Apr 03 '25 17:04 jmanhype

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar May 04 '25 02:05 github-actions[bot]

Right now those docs are the canonical information that we have at the moment. We'll try to continue to do benchmarking and stuff, but I think I'll close this issue. If there's something more concrete that you'd like to see, please open a more specific issue that has a clear "definition of done" and we can see if we can address it.

neubig avatar May 12 '25 17:05 neubig