[Feature]: Weekly evaluations for performance tracking
What problem or use case are you trying to solve?
We would like to run evaluations of a few agents in OpenDevin approximately weekly in order to:
- Make sure there have not been regressions in our end-to-end performance
- Track progress in improving agents
Describe the UX of the solution you'd like
It would be good to have:
- [ ] a cron-job style eval runner that runs evaluations at the same time every week.
- [ ] the results pushed to some public place, perhaps a Google Sheet would be a good start
Do you have thoughts on the technical implementation?
- These evals should probably run on the cloud.
- We can use the recently implemented e2b sandbox to allow us to run the agents.
- We'll probably have to provision an API key that we pay for in some way to run the LLMs.
Describe alternatives you've considered
Someone could have to have a cron job running on their own machine or run it manually. So e2b providing an agent sandbox that allows us to run this eval would be excellent.
Additional context
- Discussion on the #swe-bench-eval channel on slack.
- Dependent on https://github.com/OpenDevin/OpenDevin/issues/795
- Dependent on https://github.com/OpenDevin/OpenDevin/issues/90
@neubig Thank you for the write up.
We can use the recently implemented e2b sandbox to allow us to run the agents.
Would you like to have a separate sandbox for each item in the dataset? I suppose the item would be a commit for a given repo, right?
I'm a bit brainstorming here but we could create a cron job (eg GH action) that just uses that correct sandbox (we'd be using custom sandbox templates) for the given commit from the bench and runs OpenDevin with it.
Hi @mlejva , honestly I'm not familiar enough with the e2b architecture to know for sure what the best design would be, but probably one sandbox for each repo state in SWE-Bench would probably be enough. There are several repos, and each has several different states. @libowen2121 is probably more familiar with that.
Hey @mlejva, me and @xingyaoww are currently working on integrating the SWE-Bench environment into the agent pipeline. Concurrently, we're examining various design options for this integration. Basically, we've prepared a set of testbeds (each a github repo restored to a specific commit id) along with matching conda environments. Then for each instance, we plan to copy the testbed to the workspace and clone the conda environment into the sandbox. Let me know if you have any suggestions.
Hey and thanks to both of you!
@libowen2121 it sounds like we could have a separate E2B sandbox for each repo on the specific commit. You spawn the sandbox using our SDK, everything is loaded, installed, and running (we can make it so all processes are already running so it's faster), and OpenDevin can start working. Because we would have a prepared sandbox beforehand for each specific commit, we could potentially get rid of copying things to the sandbox
@libowen2121 is there code anywhere for what you already have that I could check?
Hi @mlejva, that sounds great! I'm not familiar with E2B. Could you please share some documentation or code where I can learn more about it? We are still in the active development and don't have a working demo yet, but I'll definitely keep you updated as soon as one is available!
@libowen2121 good start and most relevant for you would probably be our docs around custom sandboxes - https://e2b.dev/docs/sandbox/templates/overview
Just ping me when you're ready or if you have any questions. They way I understand is that until then I should be waiting (let me know if not :))
@mlejva Sure, no problem!
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This probably shouldn't be marked stale.
I have set up some scripts that make evals easy to run for me locally, and I think that's probably enough for now. Honestly, I'm not sure that we want to be running weekly at this point given the cost of running a SWE-bench evaluation -- it would probably be higher-value to run when we have significant agent improvements. So I guess we can probably close this issue for now.