OpenHands [Feature]: Retry on failure functionality

What problem or use case are you trying to solve?

Sometimes models fail to do their job correctly, and we would benefit from starting all over from the beginning. There are a few examples of this in the agent literature:

Aider recently introduced a harness for testing SWE-bench that allows for retries when tests and linting don't pass on swe-bench
@Jiayi-Pan has work on Evaluation and Refinement for web agents that use a reward model to judge when a web task has failed, a reset mechanism to return to the beginning, and a method for improving the prompts based on reflexion.
@niansong1996 has a method LEVER that uses a learned verifier to rerank code generation execution results.

Describe the UX of the solution you'd like

Ideally, this would be something that could be implemented in a general way, so that we could implement different strategies with a shared interface. For instance:

class ResetStrategy:

   @abstractmethod
   def initialize_state():
       """Take note of the initial state that should be reset too."""
       ...

   @abstractmethod
   def verify(...):
      """This verifies whether the agent has reached a failure state."""
      ...
   
   @abstractmethod
   def reset(...):
      """This performs some sort of reset."""
      ...

   @abstractmethod
   def message_on_reset(...):
      """This creates a message to the agent upon reset (e.g. a task with a prompt based on reflexion)."""
      ...

Then, when using OpenDevin, we could choose an option that says "retry N times when you get stuck", and select the strategy that is used to do so.

Do you have thoughts on the technical implementation?

The actual reset strategies would vary based on the task. For instance:

AiderResetStrategy (code reference):
- initialize: save the current git commit of the repository commit_id
- verify: tests+linting pass
- reset: git checkout commit_id
- message_on_failure: no-op
EvalRefineResetStrategy (code reference):
- initialize: save the current web page initial_page
- verify: the reward model is positive
- reset: goto(initial_page)
- message_on_failure: reflexion prompt

This could either be integrated into OpenDevin, allowing for retries in the main app as well

Additional context:

Aider superissue
Thanks to @xingyaoww and @frankxu2004 for offline discussion

Jun 03 '24 13:06 neubig

Thanks for creating the issue! Although I don’t have much spare bandwidth recently, I am definitely interested in bringing EvalRefineResetStrategy and the retry functionality into OpenDevin. I will keep an eye on this PR and contribute once I have the time.

Jun 06 '24 05:06 Jiayi-Pan

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Aug 10 '24 01:08 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sep 13 '24 01:09 github-actions[bot]

Hi! Is anyone working on this?

Oct 05 '24 16:10 Vaishakh-SM

Hey @Vaishakh-SM , I think nobody is working on this, but @xingyaoww was thinking about adding multiple runs to evaluation. I think that would be a parallel effort though, because it would involve running multiple times and picking the best one, as opposed to restarting when the first try didn't work.

If you'd be interested in taking a look it'd be welcome!

Oct 05 '24 22:10 neubig

This seems like an interesting problem!

I'll take a look and get back to this sometime this week.

Oct 08 '24 05:10 Vaishakh-SM

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Nov 11 '24 01:11 github-actions[bot]

@neubig this is a really old issue. Just want to make sure, we haven't implemented this yet, right?

Dec 05 '24 16:12 mamoodi

Yep, @xingyaoww is working on a critic that could help implement this.

Dec 05 '24 16:12 neubig

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jan 27 '25 01:01 github-actions[bot]

Using OpenHands now for a whole project, I just can say a retry in general would be nice. I've seen several failures, when things could not be replaced or far more often, if they could not be generated because of rate limits of the models. The bigger your codebase gets, the more tokens are used and the more often you hit the rate limits. This led to several cases, where files had been deleted, but couldn't be generated again. Happy to share more insights.

Feb 25 '25 18:02 manzke

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Mar 28 '25 02:03 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jun 07 '25 02:06 github-actions[bot]

We're still working on this

Jun 07 '25 14:06 neubig

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jul 08 '25 02:07 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Aug 08 '25 02:08 github-actions[bot]

Is this still being actively worked on?

Aug 25 '25 12:08 mamoodi

Yes, it's still part of the critic that @xingyaoww is working on

Aug 25 '25 16:08 neubig