components-contrib icon indicating copy to clipboard operation
components-contrib copied to clipboard

Resiliency for Components

Open halspang opened this issue 2 years ago • 6 comments

Describe the proposal

With Resiliency being introduced into the Dapr runtime in 1.7 (as a preview feature), we should now consider how Resiliency interacts with components and if there is a way to get them to utilize the runtime's resiliency.

Tenets

  • Components must be able to interact with Dapr's resiliency
  • Runtime resiliency must not weaken/remove any component resiliency without a direct replacement
  • Components must be able to onboard over time
  • Components should be able to advertise/be recognized that they have implemented resiliency

Background

Components across the various building blocks have varying levels of retries. Some components, such as the Redis statestore offer both retries and timeouts. Other components offer just retry toggling like the Kafka pubsub which can enable consumption retries. Further still, some components don't offer traditional retries but rather the service they interact with has their own semantic version of retries. A prime example of this is redelivery count for Azure Service Bus. Exhausting the redelivery count can even cause messages to enter a Dead Letter Queue, which is another avenue of resiliency.

The moral of the story here is that component level retries are vast and varied. What works for one component may not fully cover another, or worse, may break some semantic functionality.

An Important Distinction

Pubsub redelivery (and by extension message binding redelivery) was briefly touched on above. It is worth calling out directly though because it has a distinct difference in functionality as compared to standard retries. Specifically, a redelivery of a message can result in a different host. This is a benefit of the pubsub architecture in that you can be resilient to host failures. Utilizing just the runtime retries here would mean you stuck to a host which, if it was dead, would reduce the delivery rate of your messages. As such, this is something that should be strongly considered and/or maintained while interacting with runtime resiliency.

Impact on Stable Certification

Handling resiliency correctly should become a future bar for becoming a stable component. Any component that is already stable will have it added over time, similar to how we handled certification tests.

Solution

The crux of this design will be a new interface that components will be expected to implement in order to be considered "resilient". The interface would define a few methods for resilient operation:

type Resilient interface {
    // Determine if resiliency should be used in place of component retries and disable/enable as needed.
    modifyComponentBehaviorForResiliency(metadata map[string]string) error
}

By making this into an interface, it forces the implementer to consider the most important point of a resilient component which is how component retries and dapr retries.

It was considered to add a method to the interface that would translate the component errors into a permanent/retryable error which the runtime understands. This was decided against because, upon initial inspection, there are fairly few errors that are actually permanent. Generally, these are bad requests, unauthenticated (handled on init) or perhaps some very specific errors (e.g. etags). With these cases, it's easier to just wrap the error in the code with backoff.Permanent(error).

This also allows for Dapr to determine if a component is resilient or not via type inspection.

Example Component

Using Redis as an example, you would see changes like this:

// Removing the component retry/timeout logic (triggered by property field "daprResiliencyPolicyDefined").
func (r *StateStore) modifyComponentBehaviorForResiliency(metadata map[string]string) error {
	if _, ok := metadata[maxRetries]; ok {
		metadata[maxRetries] = "0"
	}

	if _, ok := metadata[maxRetryBackoff]; ok {
		metadata[maxRetryBackoff] = "-1"
	}

	if _, ok := metadata[readTimeout]; ok {
		metadata[readTimeout] = "-1"
	}

	if _, ok := metadata[writeTimeout]; ok {
		metadata[writeTimeout] = "-1"
	}

	// This is an alias for maxRetries, but it has to be checked as well.
	if _, ok := metadata[redisMaxRetries]; ok {
		metadata[redisMaxRetries] = "0"
	}

	return nil
}

// Example of returning a permanent retry.
func (r *StateStore) setValue(req *state.SetRequest) error {
	err := state.CheckRequestOptions(req.Options)
	if err != nil {
		return err
	}
	ver, err := r.parseETag(req)
	if err != nil {
		return err
	}
	ttl, err := r.parseTTL(req)
	if err != nil {
                 // We will always fail to parse a malformed request.
		return backoff.Permanent(fmt.Errorf("failed to parse ttl from metadata: %s", err))
	}
        // Remainder of function excluded.
}

Example Runtime Init

props := a.convertMetadataItemsToProperties(s.Spec.Metadata)

// Only tell the component to turn off resiliency if a policy is defined for this component.
compType := reflect.TypeOf(store)
if a.resiliency.PolicyDefined(s.Name, resiliency.Component) && compType.Implements(reflect.TypeOf((*Resilient)(nil)).Elem()) {
	props["daprResiliencyPolicyDefined"] = "true"
}

err = store.Init(state.Metadata{
	Properties: props,
})

halspang avatar Jul 07 '22 22:07 halspang

@berndverst - Any thoughts on this?

halspang avatar Aug 02 '22 16:08 halspang

LGTM

Agree that in the future implementation of Resiliency should be a requirement for stable component certification.

So every error not explicitly returned as backoff.Permanent is considered retriable? Sounds good, we will need to evaluate what errors are really permanent and which ones aren't. Could be handy for Azure components for example to create a utility which parses the internal status code and returns backoff.Permanent if the internal code was certain 4XX status codes.

Also as discussed initResiliency(metadata map[string]string) error is a terrible name. It makes more sense to call this modifyComponentBehaviorForResiliency or something like that. The purpose is to disable / tweak component specific retries that are controlled via component metadata so that the resiliency policy can work its magic.

berndverst avatar Aug 11 '22 23:08 berndverst

So every error not explicitly returned as backoff.Permanent is considered retriable? Sounds good, we will need to evaluate what errors are really permanent and which ones aren't. Could be handy for Azure components for example to create a utility which parses the internal status code and returns backoff.Permanent if the internal code was certain 4XX status codes.

A lot of this proposal is about forcing people to think about the individual component and what it means to be retryable. I think for components like Azure (and probably GCP/AWS) where they have a commonality in the SDKs will take to a common utility file well.

Also as discussed initResiliency(metadata map[string]string) error is a terrible name. It makes more sense to call this modifyComponentBehaviorForResiliency or something like that. The purpose is to disable / tweak component specific retries that are controlled via component metadata so that the resiliency policy can work its magic.

Yeah, I'm not sure what my initial thought was on the naming there. Updated it in the proposal.

halspang avatar Aug 12 '22 16:08 halspang

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot avatar Sep 11 '22 17:09 dapr-bot

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

dapr-bot avatar Sep 18 '22 17:09 dapr-bot

This will need design work for runtime, and then refactorings of every single component which is a lot of work - similar to the effort to do stable certification of components.

berndverst avatar Oct 31 '23 16:10 berndverst