contour
contour copied to clipboard
Failures in External services configured in envoy should be visible to the application developer
We have a few requests currently to allow the configuration of external services of various kinds in Envoy: #432, #1691, and #370.
The thing that all of these have in common is that with our current Contour design, there is no way for Contour to do anything other than pass the config to Envoy.
This means that, for example, there is no way for someone configuring external auth for a particular HTTPProxy to know if that external auth is working correctly without having access to the external auth service.
Another example is that there is no way for someone using a service with a rate-limiter to know if their service has tripped the rate limit.
When something does go wrong with one of these services, there needs to be a way for someone using them indirectly to know where the problem is.
I think that this problem requires two things:
Contour should be able to health check clusters in Envoy
All the external services must be configured as a cluster in Envoy. So, if Contour has a way to check the state of a cluster in Envoy (whether it has healthy endpoints, or some other information about it), then we have the information we need to pass to the application developer.
Contour should be able to expose external service health info
Contour should be able to expose external service health info in the relevant place, whether that is a status field on an object like a HTTPProxy, a log line in Contour, a metric, or some combination of the above.
Obviously the first is a requirement for the second.
I'm not sure of the best way to check Envoy clusters from Contour, whether it's some gRPC thing, checking the stats by fetching them, or something else.
This issue is to cover:
- if this is a good idea
- doing the two steps if it is.
In principle, I think that exposing envoy's information about clusters is a good idea. It's useful for regular services and also for special infrastructure services.
Note to self:
General envoy cluster metrics that could be used to support this feature:
| Name | Type | Desc |
|---|---|---|
| membership_healthy | Gauge | Current cluster healthy total (inclusive of both health checking and outlier detection) |
| membership_degraded | Gauge | Current cluster degraded total |
| membership_total | Gauge | Current cluster membership total |
| upstream_cx_none_healthy | Counter | Total times connection not established due to no healthy hosts |
upstream_cx_none_healthy is pretty interesting if we can use it to create a level-based signal. Otherwise need some more research around the membership metrics.
With the addition of ExtensionService, including a Conditions block, we have the space available for Contour to set a Ready condition on an ExtensionService, this would indicate that the service is up and able to receive traffic. I'm still not currently sure of the best way to grab this information, however. It could be that the best way is to do a lookup of the Endpoints associated with the Service associated with the ExtensionService, and update the status that way (on the assumption that if there are Kubernetes Endpoints, then Envoy will be able to send traffic there.)
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
- After 60d of inactivity, lifecycle/stale is applied
- After 30d of inactivity since lifecycle/stale was applied, the Issue is closed
You can:
- Mark this Issue as fresh by commenting
- Close this Issue
- Offer to help out with triage
Please send feedback to the #contour channel in the Kubernetes Slack
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
- After 60d of inactivity, lifecycle/stale is applied
- After 30d of inactivity since lifecycle/stale was applied, the Issue is closed
You can:
- Mark this Issue as fresh by commenting
- Close this Issue
- Offer to help out with triage
Please send feedback to the #contour channel in the Kubernetes Slack