go-resiliency feature request: retrier: reset number of retries/backoff time

Imagine that I have an application that is an event processor that connects to some event log and for every event that arrives I want to do something with it. Also imagine that I have a Retrier with exponential backoff (100ms initial backoff time) and 5 retries.

We can get errors because of network issues (server not ready to accept connections, server disconnected for some reason like server LoadBalancer downscaling instances) or errors processing events.

If the app fails 5 times to process an event the retry loop should end and return error. This is good. If my app fails to connect 5 times in a row the same should happen.

But if my app connects to the server, happily processes messages for 2hours then gets a server disconnection. It will retry the connection and (if successful) continue processing events. In the next few hours we receive 4 more server disconnections, but after each one we're able to reconnect and continue processing messages for some time. It does not make sense to say "we've failed 5 times so let's just finish here and return an error".

So basically the Retrier should expose a Reset() method which would cause the retrier to "forget" that it encountered errors before so that the next time that it encounters an error, instead of sleeping for a long time due to exponential backoff it will just sleep for the initial 100ms (as per the example) and it could tolerate 5 more errors later.

Here's a very simple skeleton of code that shows how this feature could be used:

r := retrier.New(retrier.ExponentialBackoff(5, 100 * time.Milliseconds), nil)
err := r.Run(func() error {
  server, err := connect(address)
  if err != nil {
    return err
  }

  for {
    event, err := server.getNextEvent()
    if err != nil {
      return err
    }
    err = processEvent(event)
    if err != nil {
      return err
    }
    // Because we've been able to process an event it seems that all is working correctly
    // so it doesn't matter that in the past we had (for example 2 errors), the next time we
    // see an error should be like it was the first error we've seen, and the backoff time to
    // to wait until next try should be the initial one passed by the user.
    r.Reset()
  }
}

Jun 20 '22 22:06 kampde

Perhaps I am missing something about the use case, but wouldn’t it be simpler to move the retrier to wrap the logic inside the for {}? That accomplishes the same thing without the need for the manual Reset call.

Jun 21 '22 01:06 eapache

You mean:

server, err := connect(address)
if err != nil {
  return err
}

for {
  r := retrier.New(retrier.ExponentialBackoff(5, 100 * time.Milliseconds), nil)
  err := r.Run(func() error {
    event, err := server.getNextEvent()
    if err != nil {
      return err
    }
    err = processEvent(event)
    if err != nil {
      return err
    }
  }
}

This won't work because if there's a disconnection we'll need to connect again, but that part is out of the retrier. And still if there were many disconnections the retrier in the end would give up.

Jun 21 '22 05:06 kampde

If you are cool with this feature I can send a PR.

Jun 21 '22 09:06 kampde

And still if there were many disconnections the retrier in the end would give up.

Each new call to Run resets the retry count, so the sample you posted would not give up as long as there was at least once success before five consecutive failures. But the point about retrying the connect call is important.

If you are cool with this feature I can send a PR.

OK

Jun 21 '22 11:06 eapache

This won't work because if there's a disconnection we'll need to connect again, but that part is out of the retrier. And still if there were many disconnections the retrier in the end would give up.

Can you try something like this?

func execute() error {
	server, err := getConn()
	if err != nil {
		return err
	}
	for err == nil {
		if err := ConsumeEvents(server); err == ErrConn {
			server, err = getConn()
		}
	}
	return err
}

func ConsumeEvents(server server) error {
	r := retrier.New(retrier.ExponentialBackoff(5, 100*time.Millisecond), retrier.BlacklistClassifier{ErrConn})
	for {
		err := r.Run(
			func() error {
				event, err := server.getNextEvent()
				if err != nil {
					return err
				}
				return processEvent(event)
			},
		)
		if err == ErrConn {
			return err
		}
	}
}

You can notice I am using BlacklistClassifier, so while getting/processing events, if it encounters ConnectionErr, the retrier will classify the error as fail as we blacklisted that error. so we can check if error returned by the retrier is a connection error. If it is we can return the error as it is and the consume event func will obtain a new connection and the process goes on.

Jun 29 '22 17:06 AyushSenapati

Closing due to lack of activity. I'm still willing to consider cases and proposals like this, but in hindsight I think the Retrier is not the right pattern in the first place when there's no real "Success" state. Maybe moving the Retrier instance to just wrap the connect call (and have everything else outside of the block) would have been best.

Jan 13 '24 15:01 eapache

go-resiliency go-resiliency copied to clipboard

feature request: retrier: reset number of retries/backoff time

go-resiliency
go-resiliency copied to clipboard