contour icon indicating copy to clipboard operation
contour copied to clipboard

Failures in External services configured in envoy should be visible to the application developer

Open youngnick opened this issue 5 years ago • 4 comments

We have a few requests currently to allow the configuration of external services of various kinds in Envoy: #432, #1691, and #370.

The thing that all of these have in common is that with our current Contour design, there is no way for Contour to do anything other than pass the config to Envoy.

This means that, for example, there is no way for someone configuring external auth for a particular HTTPProxy to know if that external auth is working correctly without having access to the external auth service.

Another example is that there is no way for someone using a service with a rate-limiter to know if their service has tripped the rate limit.

When something does go wrong with one of these services, there needs to be a way for someone using them indirectly to know where the problem is.

I think that this problem requires two things:

Contour should be able to health check clusters in Envoy

All the external services must be configured as a cluster in Envoy. So, if Contour has a way to check the state of a cluster in Envoy (whether it has healthy endpoints, or some other information about it), then we have the information we need to pass to the application developer.

Contour should be able to expose external service health info

Contour should be able to expose external service health info in the relevant place, whether that is a status field on an object like a HTTPProxy, a log line in Contour, a metric, or some combination of the above.

Obviously the first is a requirement for the second.

I'm not sure of the best way to check Envoy clusters from Contour, whether it's some gRPC thing, checking the stats by fetching them, or something else.

This issue is to cover:

  • if this is a good idea
  • doing the two steps if it is.

youngnick avatar Mar 05 '20 04:03 youngnick

In principle, I think that exposing envoy's information about clusters is a good idea. It's useful for regular services and also for special infrastructure services.

jpeach avatar Mar 11 '20 04:03 jpeach

Note to self:

General envoy cluster metrics that could be used to support this feature:

Name Type Desc
membership_healthy Gauge Current cluster healthy total (inclusive of both health checking and outlier detection)
membership_degraded Gauge Current cluster degraded total
membership_total Gauge Current cluster membership total
upstream_cx_none_healthy Counter Total times connection not established due to no healthy hosts

upstream_cx_none_healthy is pretty interesting if we can use it to create a level-based signal. Otherwise need some more research around the membership metrics.

jpeach avatar Apr 20 '20 05:04 jpeach

With the addition of ExtensionService, including a Conditions block, we have the space available for Contour to set a Ready condition on an ExtensionService, this would indicate that the service is up and able to receive traffic. I'm still not currently sure of the best way to grab this information, however. It could be that the best way is to do a lookup of the Endpoints associated with the Service associated with the ExtensionService, and update the status that way (on the assumption that if there are Kubernetes Endpoints, then Envoy will be able to send traffic there.)

youngnick avatar Sep 18 '20 00:09 youngnick

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Feb 28 '24 00:02 github-actions[bot]

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Mar 29 '24 00:03 github-actions[bot]