contour icon indicating copy to clipboard operation
contour copied to clipboard

Add RetryPolicy in ExtensionServiceSpec

Open jerome-quere opened this issue 2 years ago • 6 comments

Hello,

I'm using Contour with an ExtensionService to manage authentication. I have an average load of 500 req/s. I'm experiencing an issue where, on some occasions, the connection to the authentication service is reset/closed by Envoy, resulting in the original request failing with a PERMISSION_DENIED error. This is not ideal for downstream services.

This issue almost always coincides with an Envoy cx destroy event.

Screenshot 2023-10-30 at 14 23 25

Since network connections are never 100% stable what should we do to handle such errors and avoid returning a PERMISSION_DENIED error.

I'm wondering if it would be possible to add a RetryPolicy in the ExtensionServiceSpec to handle this type of situation.

jerome-quere avatar Oct 30 '23 16:10 jerome-quere

@jerome-quere that sounds reasonable, would you be interested in contributing a change here?

skriss avatar Nov 16 '23 17:11 skriss

Unrelated to the topic at hand but great to see users diagnosing issues using Envoy's stats output, if youre interested in contributing in this area too please let us know! Here's another issue that would be great to get some user input on: https://github.com/projectcontour/contour/issues/5655

sunjayBhatia avatar Nov 16 '23 17:11 sunjayBhatia

Is this supported by envoy? iiuc envoy doesnt support retries for ext-auth services https://github.com/envoyproxy/envoy/issues/17918

davinci26 avatar Dec 07 '23 02:12 davinci26

I think it should be possible via:

extension-envoy-filters-http-ext-authz.grpc_service.envoy_grpc.retry_policy setting

jerome-quere avatar Dec 14 '23 15:12 jerome-quere

Ty for the pointer. The docs document this field as:

Indicates the retry policy for re-establishing the gRPC stream

I read this as a tcp level retry which only retries on stream establishment whereas Contour Retry policy is a more generic retry policy.

I think it might be confusing for the user to expose the entire RetryPolicy object knowing that it is supported with many asterisks.

Knowing the asterisks is this the type of retries you want?

davinci26 avatar Dec 18 '23 14:12 davinci26

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Feb 19 '24 00:02 github-actions[bot]

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Mar 26 '24 00:03 github-actions[bot]