[Feature]: Retry on failure functionality
What problem or use case are you trying to solve?
Sometimes models fail to do their job correctly, and we would benefit from starting all over from the beginning. There are a few examples of this in the agent literature:
- Aider recently introduced a harness for testing SWE-bench that allows for retries when tests and linting don't pass on swe-bench
- @Jiayi-Pan has work on Evaluation and Refinement for web agents that use a reward model to judge when a web task has failed, a reset mechanism to return to the beginning, and a method for improving the prompts based on reflexion.
- @niansong1996 has a method LEVER that uses a learned verifier to rerank code generation execution results.
Describe the UX of the solution you'd like
Ideally, this would be something that could be implemented in a general way, so that we could implement different strategies with a shared interface. For instance:
class ResetStrategy:
@abstractmethod
def initialize_state():
"""Take note of the initial state that should be reset too."""
...
@abstractmethod
def verify(...):
"""This verifies whether the agent has reached a failure state."""
...
@abstractmethod
def reset(...):
"""This performs some sort of reset."""
...
@abstractmethod
def message_on_reset(...):
"""This creates a message to the agent upon reset (e.g. a task with a prompt based on reflexion)."""
...
Then, when using OpenDevin, we could choose an option that says "retry N times when you get stuck", and select the strategy that is used to do so.
Do you have thoughts on the technical implementation?
The actual reset strategies would vary based on the task. For instance:
- AiderResetStrategy (code reference):
initialize: save the current git commit of the repositorycommit_idverify: tests+linting passreset:git checkout commit_idmessage_on_failure: no-op
- EvalRefineResetStrategy (code reference):
initialize: save the current web pageinitial_pageverify: the reward model is positivereset:goto(initial_page)message_on_failure: reflexion prompt
This could either be integrated into OpenDevin, allowing for retries in the main app as well
Additional context:
- Aider superissue
- Thanks to @xingyaoww and @frankxu2004 for offline discussion
Thanks for creating the issue! Although I don’t have much spare bandwidth recently, I am definitely interested in bringing EvalRefineResetStrategy and the retry functionality into OpenDevin. I will keep an eye on this PR and contribute once I have the time.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hi! Is anyone working on this?
Hey @Vaishakh-SM , I think nobody is working on this, but @xingyaoww was thinking about adding multiple runs to evaluation. I think that would be a parallel effort though, because it would involve running multiple times and picking the best one, as opposed to restarting when the first try didn't work.
If you'd be interested in taking a look it'd be welcome!
This seems like an interesting problem!
I'll take a look and get back to this sometime this week.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
@neubig this is a really old issue. Just want to make sure, we haven't implemented this yet, right?
Yep, @xingyaoww is working on a critic that could help implement this.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Using OpenHands now for a whole project, I just can say a retry in general would be nice. I've seen several failures, when things could not be replaced or far more often, if they could not be generated because of rate limits of the models. The bigger your codebase gets, the more tokens are used and the more often you hit the rate limits. This led to several cases, where files had been deleted, but couldn't be generated again. Happy to share more insights.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
We're still working on this
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Is this still being actively worked on?
Yes, it's still part of the critic that @xingyaoww is working on