restate icon indicating copy to clipboard operation
restate copied to clipboard

Limit retries of side effects

Open slinkydeveloper opened this issue 1 year ago • 3 comments

We need a mechanism to override the invoker retry policy on an invocation basis from within the SDK. This is required to prevent infinite loops of side effect retries.

slinkydeveloper avatar Apr 23 '24 12:04 slinkydeveloper

Another solution for this could be to provide in the side effect closure the retry count.

slinkydeveloper avatar Jun 10 '24 13:06 slinkydeveloper

I've put some thought into this, and I have a more or less concrete proposal:

  • On StartMessage we send a retry_attempt. This retry_attempt count is kept track on the invoker side in memory (meaning it's eventually consistent), and it's reset on each new entry, meaning it will be >= 1 only if the invoker retries invoking more than once with the same journal
  • On ErrorMessage we add a new optional field to specify the interval before retrying.
  • With those two fields, now the SDK can allow users to set a retry policy (and even let the user create a custom one) on side effects. In case of a side effect failure, even we write the EndMessage with the interval before the next retry, or we record and throw a terminal exception in case the retry attempt is exhausted.

This solution requires to implement those retry policies in every SDK, but this should be few code lines, and it allows users to configure custom ones, or perhaps even hook existing libraries (e.g. resilience4j). Plus it doesn't force the definition of retry policies on the protocol.

The caveat of this solution is that the effective retry count might be higher than the one the user provides. This can happen in a number of situations, e.g. leader election in a distributed setup, restate crashes. However, this should be fine as side effects are already at-least once, so many use cases will be fine with it. If the user wants stricter guarantees, they can build themselves a solution by inserting every run attempt in the journal (which in fact it still won't provide 100% the guarantee that the retry count will be exactly the one they expect to be).

slinkydeveloper avatar Jun 14 '24 12:06 slinkydeveloper

I like the proposal and the simplicity of the building blocks on the server side.

tillrohrmann avatar Jun 14 '24 13:06 tillrohrmann

Runtime is now implemented

slinkydeveloper avatar Aug 28 '24 07:08 slinkydeveloper

Closing this now and opened the followups on the specific SDKs

slinkydeveloper avatar Sep 03 '24 12:09 slinkydeveloper