specification Add multi retryRef to action

What would you like to be added:

In current version, there is only one retry strategy can be assign to action:

"actions": [
            {
              "functionRef": "StorePatient",
              "retryRef": "ServicesNotAvailableRetryStrategy",
              "retryableErrors": ["ServiceNotAvailable"]
            }]

 "retries": [
    {
      "name": "ServicesNotAvailableRetryStrategy",
      "delay": "PT3S",
      "maxAttempts": 10
    }
  ]

If a action can reference multi retry strategy it will be useful.

Why is this needed:

amazon states language support multi retry strategy https://states-language.net/spec.html#errors
If the action failed, user may want use different retry strategy, for example:
- delay 1s if the error is short-term（ex: FlowLimit）
- delay 5s if the error is long-term（ex: database is not available）

Sep 15 '22 09:09 lsytj0413

This is interesting. I think it can be useful to add the retriable error in the retry definition as Amazon States Language does, or it will be difficult to understand the mapping of two arrays. Technically, it will be the same, but it will improve understanding.

Can be something like this:

{
   "actions":[
      {
         "functionRef":"StorePatient",
         "retryRefs":[
            "ServicesNotAvailableRetryStrategy",
            "NetworkIssue"
         ]
      }
   ],
   "retries":[
      {
         "name":"ServicesNotAvailableRetryStrategy",
         "delay":"PT3S",
         "maxAttempts":10,
         "errors":[
            "ServiceNotAvailable"
         ]
      },
      {
         "name":"NetworkIssue",
         "delay":"PT10S",
         "maxAttempts":3,
         "errors":[
            "ConnectionFailed"
         ]
      }
   ]
}

Sep 15 '22 18:09 ricardozanini

Currently，the retryableErrors is in Action's definition.

Sep 16 '22 02:09 lsytj0413

I think this would be nice feature to add (even tho possibly very hard to implement). Also if I understand it right it targets idempotent actions where the first error returned determines the retry policy for the remainder of retries, and is not a dynamic retry policy recalculated based on each retry and errors that happened so far (more on that below).

Looking at the mentioned AWS samples:

"Retry" : [
    {
      "ErrorEquals": [ "States.Timeout" ],
      "MaxAttempts": 0
    },
    {
      "ErrorEquals": [ "States.ALL" ]
    }
]

I think this can be implemented already by setting States.Timeout as a non-retryable error.

"Retry": [
    {
      "ErrorEquals": [ "ErrorA", "ErrorB" ],
      "IntervalSeconds": 1,
      "BackoffRate": 2,
      "MaxAttempts": 2
    },
    {
      "ErrorEquals": [ "ErrorC" ],
      "IntervalSeconds": 5
    }
  ]

This is what we are talking about I think. I'd like to know more on how they are handling this when different errors have different maxAttempt values. Think this can run into a number of edge cases that would need to be clearly described. Don't think they describe it in the linked doc but this i think would require runtimes to deal with X number of retry policies depending on error type, and in case on non-idempotent actions idk what would do if lets say action throws a different error type on each retry..things can get pretty complicated imo. Maybe what they say is that the retry definition is set on first action error, and then is picked for the duration of all action retries. Then I think this would be much easier to deal with, but if you have to recalculate all retry policies on each retry based on error, this is daunting task imo. This however then dictates idempotent actions which we would need to make a "must" in the spec, need to think about it.

Sep 16 '22 02:09 tsurdilo

Can you give more example on not useful for idempotent actions?

I think this would be nice feature to add (even tho possibly very hard to implement). Also I think would not be very useful for idempotent actions.

imo the specification should define how to handing maxAttempt when different errors, like use the value based on error. For example:

we have an action named CreateDisk, and the action maybe failed with FlowLimit，InsufficientResource
When error is InsufficientResource, we only retry for 3 times
When error is FlowLimit, we retry for 30-times

Sep 16 '22 02:09 lsytj0413

Yes I updated some text in my answer above (please look again).

The retry with this is tied to an error type. In this case we can either assume that if the first invocation raises FlowLimit its retried up to 30 times regardless of errors raised by consequent retries. Or you would re-calculate the retry policy if the second retry throws InsufficientResource.

I think this is doable iff we go with the first assumption (which i think is what aws does too but not sure, let me know if you know). wdyt.

Sep 16 '22 02:09 tsurdilo

I don't known how AWS does it, but in my company's workflow engine（base on ASL），it will use the value based on error and record the retry history. For example: the CreateDisk may retry upon 30+ times when error is mixed with FlowLimit & InsufficientResource.

I think this is doable iff we go with the first assumption (which i think is what aws does too but not sure, let me know if you know). wdyt.

Sep 16 '22 03:09 lsytj0413

Just to add as a general info thing, imo it is an anti-pattern to limit retries via MaxAttempt (outside test scenarios). For most use cases you should limit your overall action retries using state timeout (thats what its designed for at least imo :) )

Sep 16 '22 03:09 tsurdilo

Just to add as a general info thing, imo it is an anti-pattern to limit retries via MaxAttempt (outside test scenarios). For most use cases you should limit your overall action retries using state timeout (thats what its designed for at least imo :) )

Yep，but MaxAttempt is useful when we want fast-to-fail if we support multi retry strategy.

Sep 16 '22 03:09 lsytj0413

but MaxAttempt is useful when we want fast-to-fail if we support multi retry strategy.

I think this really depends on the action timeout (single action execution limit). If you set the action timeout to a large value, and your action is "stuck" for minutes or hours you cannot fail fast even with just 2 retries.

The best way to deal with your use cases is to have your actions "heartbeat" meaning report progress and have a heartbeat timeout set to a very small value within which the action has the report its progress/heartbeat. That way if you detect this heartbeat timeout you can fail the action and do your retry fast.

This however idk if we can impl in our DSL as not sure we can enforce heartbeating to all runtimes.

Sep 16 '22 03:09 tsurdilo

Overall +1 on adding this, just would like to maybe get one or two examples on how this would look like if its done via the mentioned dynamic retry policy calculations on each retry. Thanks!

Sep 16 '22 03:09 tsurdilo

I think this really depends on the action timeout (single action execution limit). If you set the action timeout to a large value, and your action is "stuck" for minutes or hours you cannot fail fast even with just 2 retries.

Base on the exmaple, we want 3 MaxAttempts and Interval 5s when InsufficientResource，30 MaxAttempts and Interval 1s when FlowLimit，we can only set the ActionTimeout to 30s=max(15s, 30s)，but if the only error is InsufficientResource we can stuck only 15s（less than 30s）.

Sep 16 '22 03:09 lsytj0413

@tsurdilo What's the next step should i do for this？is an PR ok？

Sep 23 '22 02:09 lsytj0413

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 08 '22 00:11 github-actions[bot]

/remove-stale

Nov 08 '22 01:11 lsytj0413

/remove stale

Dec 11 '22 12:12 lsytj0413

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 26 '23 00:01 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 14 '23 00:03 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 12 '23 00:06 github-actions[bot]

Closed as resolved by 1.0.0-alpha1, and therefore as part of #843

May 17 '24 08:05 cdavernas