consul icon indicating copy to clipboard operation
consul copied to clipboard

Add the ability to retry on reset connection to service-routers

Open aoskotsky-amplify opened this issue 3 years ago • 13 comments

Description

This adds a ~RetryOnReset~ RetryOn option to service-router configs.

My apps are getting an error intermittently when using consul connect: upstream connect error or disconnect/reset before headers. reset reason: connection termination. Envoy has an option to automatically retry in case of this kind of error. Consul does not support this option at the moment so I'm implementing it here.

Testing & Reproduction steps

Create a service-router with the RetryOn option like below:

{
    "Kind": "service-router",
    "Name": "test",
    "Routes": [
        {
            "Match": {
                "HTTP": {
                    "PathPrefix": "/"
                }
            },
            "Destination": {
                "NumRetries": 3,
                "RetryOn": ["reset"],
                "Service": "test"
            }
        }
    ]
}

Links

Issue #10274

PR Checklist

  • [x] updated test coverage
  • [x] external facing docs updated
  • [x] not a security concern
  • [x] checklist folder consulted

aoskotsky-amplify avatar Apr 29 '22 04:04 aoskotsky-amplify

CLA assistant check
All committers have signed the CLA.

hashicorp-cla avatar Apr 29 '22 04:04 hashicorp-cla

Hey @aoskotsky-amplify

Thanks for the first-time contribution ( hope to see many more 👀 )! Generally we expose envoy-config on a per-case basis, but on first glance this use-case seems pretty solid. Since this wasn't made off an associated github issue, i'll talk this over with the team and let you know tommorow if we'd like to expose this config to everyone.

Then hopefully we'll get this reviewed and merged shortly after 👍

Amier3 avatar May 02 '22 14:05 Amier3

@aoskotsky-amplify

Blake just let me know that this change came from a conversation on gitter, so you're all good to go 👍 . We'll get this reviewed shortly and thank you again for the contribution!

Amier3 avatar May 02 '22 16:05 Amier3

@aoskotsky-amplify

Blake just let me know that this change came from a conversation on gitter, so you're all good to go 👍 . We'll get this reviewed shortly and thank you again for the contribution!

Awesome. Thanks!

aoskotsky-amplify avatar May 02 '22 16:05 aoskotsky-amplify

Hi @aoskotsky-amplify,

Thank you for this contribution. I had an opportunity to discuss this PR with the team. Here's the feedback from our review.

Envoy supports a number of retry conditions. At the time of writing, there are ten (10) x-envoy-retry-on conditions and five (5) x-envoy-retry-grpc-on conditions. Instead of exposing condition through their own variables–as is the current approach used in Consul and the one followed by this PR–we think a better solution is to expose retry_on as a top-level array in the service router configuration where users can specify a list of retry conditions.

Would you be willing to modify your PR to accommodate this alternative UX? If so, we can share a few UX options we are considering and get your feedback on them prior to you making any changes to the PR.

blake avatar May 05 '22 07:05 blake

Sure, I can modify the PR. What UX options are you considering?

aoskotsky-amplify avatar May 05 '22 12:05 aoskotsky-amplify

Hi @aoskotsky-amplify, here's the UX we were considering.

It would probably be beneficial to have input validation on the array to limit the retry conditions that can be specified. After looking at Envoy's docs, I believe we'd only want to allow the following conditions to be set:

  • 5xx
  • gateway-error
  • reset
  • connect-failure
  • envoy-ratelimited
  • retriable-4xx
  • refused-stream

gRPC retry conditions

  • cancelled
  • deadline-exceeded
  • internal
  • resource-exhausted
  • unavailable

Special handling for certain conditions

connect-failure

The connect-failure condition would need some special handling since there is currently a dedicated variable for enabling that retry condition, RetryOnConnectFailure.

If RetryOnConnectFailure is set to true, and connect-failure is not already present in retryPolicy.RetryOn, the connect-failure value should be appended to the list of retry conditions.

retriable-status-codes

I don't believe it makes sense to allow retriable-status-codes to be set directly in the the array of retry conditions because it is dependent on retryPolicy.RetriableStatusCodes also being configured. Since Consul does not configure Envoy to allow the x-envoy-retriable-status-codes header to be used to provide this list of codes (this is an alternative method for specifying the retry codes), the only option users have is to specify the list with RetryOnStatusCodes in the service router configuration.

In this case, I think it makes sense for that condition to only be enabled if one or more status codes are specified in RetryOnStatusCodes.

retriable-headers

The retriable-headers is dependent on a list of retriable headers also being configured. Similar to the retriable status codes, the list of headers cannot be specified using the x-envoy-retriable-header-names HTTP header.

Being that Consul does not have a configuration option that allows configuring retryPolicy.RetriableHeaders, I do not think it makes sense to allow this retry condition to be specified in the list.

What are your thoughts on this proposed UX?

blake avatar May 17 '22 00:05 blake

Hi @blake, that UX makes sense to me. I'll start working on it.

I have a couple of questions though:

  1. Does Consul support service-router configs with gRPC services? I see the service-router docs say services need to be HTTP. If if it does, do you think the retry conditions should be a separate config option? Would all the options be available under a RetryOn option or would there separately be a GrpcRetryOn?

  2. Do you think in the future there should be a RetryOnHeaders option in a separate PR?

aoskotsky-amplify avatar May 17 '22 02:05 aoskotsky-amplify

Hi @aoskotsky-amplify,

Service routers do support routing to gRPC services. The beginning of the doc implies under requirements that service routers can only be used with services using the http protocol, but this is not correct. Service routers can also route based on gRPC methods. See https://www.consul.io/docs/connect/config-entries/service-router#grpc-routing for an example.

Do you think in the future there should be a RetryOnHeaders option in a separate PR?

Yes, we would like to support this feature as well. I think it makes sense to add it in a separate PR.

blake avatar May 17 '22 19:05 blake

Oh I see. Do you think there should be separate options like RetryOn and GrpcRetryOn?

aoskotsky-amplify avatar May 17 '22 19:05 aoskotsky-amplify

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

github-actions[bot] avatar Jul 17 '22 01:07 github-actions[bot]

@blake I updated this PR with your suggestions. Can you give this another review? Thanks

aoskotsky-amplify avatar Aug 16 '22 05:08 aoskotsky-amplify

Hey @blake @Amier3 can you please give this PR another review?

aoskotsky-amplify avatar Sep 06 '22 14:09 aoskotsky-amplify

@mkeeler could someone from Hashicorp review this PR? Thanks!

aoskotsky-amplify avatar Oct 03 '22 17:10 aoskotsky-amplify