OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Create a competitive agent with open LLMs

Open neubig opened this issue 10 months ago • 14 comments

What problem or use case are you trying to solve?

Currently OpenDevin somewhat works with the strongest closed LLMs such as GPT-4 or Claude Opus, but we have not confirmed good results with open LLMs that can be run locally. We would like to create a formula to achieve competitive results with local LMs.

Do you have thoughts on the technical implementation?

This will require a strong (perhaps fine-tuned) coding agent LLM. It will probably have to be tuned based on strong code LMs such as CodeLlama, StarCoder, DeepseekCoder, or some other yet-to-be-released LLM.

neubig avatar Apr 14 '24 02:04 neubig

The user should be able to choose single or multiple LLM to power all the agents. For example, mixtral could power the generalized agents while deepseekcoder power the code generating agents, and white-rabbit-neo could power the testing/cybersecurity agents. This way, only one LLM will be active at a time as per the active agent, and multiple niche specific open LLM could collaborate to outperform private LLMs like gpt-4 while running locally on consumer grade hardware.

rezzie-rich avatar Apr 14 '24 16:04 rezzie-rich

I think the models need to be "self-prompting"

From the experience I have had with OpenDevin there are a lot of times it gets close to doing the thing that I want it to but it falls short of the goal and then just starts either repeating the same command or will just do something random.

It would be interesting to use two distinct prompting strategies so that the model effectively has a conversation with itself. The first prompt would be something along the lines of looking at its previous actions and the goal and coming up with a plan for the next action it could take. Then the second prompt would be getting the agent to perform an action based on the thoughts provided by the response to the first prompt.

I think this would offer the agent more flexibility and it would give it more ability to guide itself towards a better in context solution than any static prompt template can. the downside is that you need to have two model queries per action you take instead of one.

Also, Microsoft just released wizardLM 2 and it is way better than anything I have tried local so far.

JayQuimby avatar Apr 17 '24 13:04 JayQuimby

gpt-pilot is quite good at this. Try it out to get an idea. I think there are planner and reviewer agents for each step.

I kind of wish OpenDevin incorporated gpt-pilot for the engine.

chrisbraddock avatar Apr 17 '24 15:04 chrisbraddock

A nice way to improve open-source LLMs is by fine-tuning them with trajectories from stronger models like GPT-4. Bonus point if we can filter out the bad ones.

One way to achieve this at scale, similar to wildchat, is to provide officially hosted open-devin interfaces that come with a free GPT-4 based backend. In exchange for freely using these agents, users need to sign up to allow free distribution of the data and rank the quality of the agents' performance for us.

I imagine this could be used to:

  1. Obtain diverse, high-quality trajectories to fine-tune open agents.
  2. As a easy to start demo, attract more users
  3. Potentially use human preference data to create a Chatbot Arena equivalent for coding agent

Jiayi-Pan avatar May 06 '24 20:05 Jiayi-Pan

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

xingyaoww avatar May 06 '24 20:05 xingyaoww

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

Amazing and thanks for the pointer! I will have a look and see what I can contribute

Jiayi-Pan avatar May 06 '24 20:05 Jiayi-Pan

@Jiayi-Pan We are currently thinking about re-purposing the existing agent tuning dataset (e.g., code, agent tuning) for (1) so we can have a preliminary v0.1 OSS model :)

xingyaoww avatar May 06 '24 20:05 xingyaoww

Also does this feel like a technical foundation for building fine-tuning tool kits through generating quasi-synthetic data?

BradKML avatar Jun 03 '24 08:06 BradKML

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Sep 02 '24 01:09 github-actions[bot]

We're still working on this!

neubig avatar Sep 02 '24 20:09 neubig

Hey @neubig , sorry for being late I was a bit busy these days and I was working on a small version but I had some resource limitations so I didn't progress.

dorbanianas avatar Sep 02 '24 22:09 dorbanianas