contour icon indicating copy to clipboard operation
contour copied to clipboard

Contour should not 503 traffic to a down service if there is another service available in the Route

Open hmorikaw opened this issue 6 years ago • 34 comments

What steps did you take and what happened:

On my K8s cluster I am using Contour ingress, running on Gimbal Cluster. 2 Contour pods running separated with 10 Envoy pods and each Envoy pod combines 2 services. Envoy pods check endpoint nodes on each service.

apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: foo
  namespace: foobar
spec:
  virtualhost:
    fqdn: foobar.com
  routes:
    - match: /
      services:
        - name: service0
          port: 80
          healthCheck:
            path: /healthy
            intervalSeconds: 3
            timeoutSeconds: 1
            unhealthyThresholdCount: 3
            healthyThresholdCount: 5
          strategy: WeightedLeastRequest
        - name: service1
          port: 80
          strategy: WeightedLeastRequest
          healthCheck:
            path: /healthy
            intervalSeconds: 3
            timeoutSeconds: 1
            unhealthyThresholdCount: 3
            healthyThresholdCount: 5

After all endpoint nodes on service1 are out of order, each Envoy pod does not update its fowaring rule for service1 and returns only 'no healthy upstream' status.

What did you expect to happen:

If all endpoints were down on a service(service1), Envoy pods update its traffic rule to forward packets for only service0 until at least one endpoint on service1 is fine.

Anything else you would like to add:

Environment:

Contour version: v0.11.0 Kubernetes version: (use kubectl version): 1.14.0 Kubernetes installer & version: 1.14.0 Cloud provider or hardware configuration: Xeon E5-2683v4 2.10GHz / 2CPU / 512GBMEM RAM / SATA SSD 240GB x 1 / 10G Base-T*2 OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS CNI : Calico v3.4

hmorikaw avatar May 22 '19 23:05 hmorikaw

Hey @ hmorikaw I'd expect that with two upstreams, the other service would take the traffic if there weren't any healthy endpoints. Let me see if I can recreate the issue.

stevesloka avatar May 24 '19 19:05 stevesloka

Yes. I expect when there aren't any healthy endpoint on a service(service0), envoy pods transport traffic flows to only the other service(service1).

hmorikaw avatar May 29 '19 01:05 hmorikaw

@stevesloka Hi, how about this issue? Can you recreate this problem? We still remain it. If you can't recreate unhealthy endpoints cluster, you stop all http servers on unhealthy endpoints cluster.

// access unhealth endpoints cluster
$ curl -H 'Host:foobar.com' vip.addr.com/test01/
upstream connect error or disconnect/reset before headers
 
// access unhealthy endpoints cluster
$ curl -H 'Host:foobar.com' vip.addr.com/test01/
under test01 01
 
$ cat ingress-test.yaml
apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: name-example-foo
  namespace: bar
spec:
  virtualhost:
    fqdn: foobar.com
  routes:
    - match: /
      services:
        - name: health-endpoints
          port: 80
          healthCheck:
            path: /healthy
            intervalSeconds: 3
            timeoutSeconds: 1
            unhealthyThresholdCount: 3
            healthyThresholdCount: 5
          strategy: WeightedLeastRequest
        - name: unhealthy-endpoints
          port: 80
          strategy: WeightedLeastRequest
          healthCheck:
            path: /healthy
            intervalSeconds: 3
            timeoutSeconds: 1
            unhealthyThresholdCount: 3
            healthyThresholdCount: 5

hmorikaw avatar Jun 10 '19 03:06 hmorikaw

Hey @hmorikaw, I just tested and am able to recreate the issue. The issue I think is the empty upstream endpoints when there are multiple upstream services.

Let me do some more digging and gather some more data points so we can better identify where the issue is. We've recently done some work here (https://github.com/heptio/contour/pull/1110) but this still seems to be an issue.

stevesloka avatar Jun 10 '19 20:06 stevesloka

So after chatting with some folks in Envoy, when you have multiple upstreams and one goes 100% unhealthy, Envoy doesn't stop routing traffic to it.

Today the solution would be to get the weight to zero which I understand is difficult since a user would have to know the cluster is unhealthy.

I'm going to brainstorm some ideas to see how we can address.

stevesloka avatar Jun 18 '19 14:06 stevesloka

I chatted with some folks in upstream in the Envoy community and their suggestion was to combine both upstreams into the same cluster and use locality or host weighting to do the weighting, as these both respect host health.

I'm not sure the pros/cons of this solution as well as the impact to Contour, so I'd look for a design doc to discuss if either of these would work or a new design needs to come up.

stevesloka avatar Jul 18 '19 02:07 stevesloka

Hi, @stevesloka. Can you share more information or docs talked about below.

use locality or host weighting to do the weighting

I'd like to know how to set the weights.

hmorikaw avatar Jul 18 '19 08:07 hmorikaw

Hey @hmorikaw! Here are the Envoy docs on what I was describing. We'd need to discuss bringing this into Contour since it requires some more configuration: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/locality_weight

stevesloka avatar Jul 18 '19 13:07 stevesloka

I think I understand the underlying problem here. To solve it we will have to do some large changes to the way we generate clusters from k8s endpoint sets. This is out of scope for beta.1, scheduling for rc1 but we might have to delay this til after Contour 1.0

davecheney avatar Sep 09 '19 03:09 davecheney

@stevesloka I'm going to take this one. I'm worried we might need to move route.services.strategy up one level if we want to move from WeightedClusters to a Cluster with multiple ClusterLoadAssignment members.

davecheney avatar Oct 03 '19 00:10 davecheney

I'm moving this issue to the backlog. Changing the way Contour generates clusters is too large a risk this close to the 1.0 release.

davecheney avatar Oct 14 '19 03:10 davecheney

This seems no longer relevant. Please let me know if it needs to be reopened

xaleeks avatar Aug 16 '21 18:08 xaleeks

To me this is super relevant. Given two upstream services, if one goes unhealthy, envoy will still try and load balance to it. It's come up a few times and is something we should at least document now as a known bug, but determine a longer fix.

stevesloka avatar Aug 16 '21 23:08 stevesloka

I don't disagree that we should sort this out, but I think the title and description need an update (since IngressRoute has been removed from Contour for some time).

youngnick avatar Aug 17 '21 04:08 youngnick

Hi @stevesloka, @youngnick,

is there any progress on that subject?

tannevaled avatar Oct 21 '21 07:10 tannevaled

I've updated the description here, but right now this is in the backlog @tannevaled. I think there's been some additional movement in related areas, but fixing this requires us to handle a surprisingly large number of edge cases, and needs a redesign of how we handle generating Envoy clusters from Endpoints.

So no progress to report at the moment, but we do plan on doing something about this at some point.

youngnick avatar Oct 26 '21 04:10 youngnick

i understand. What do you think, for the moment, of a solution using a contour health-check container that could dynamically put the unresponsive cluster weight to 0 in the HTTPProxy without modifying the way Envoy clusters are generated?

tannevaled avatar Oct 28 '21 07:10 tannevaled

An additional controller to do that would make sense as a stopgap, I suppose. I'd be worried about induced apiserver latency in that case though.

youngnick avatar Oct 31 '21 23:10 youngnick

I don't like the idea of a new controller, this should be table stakes for Contour.

You could implement this today with a k8s readiness probe on the pods (like you suggested @tannevaled), then when they fail they'd go unhealthy and the endpoints would end up being zero. Contour for sure would know if a service had zero endpoints and could set the weight to zero to avoid the issue.

However, just to note, with GatewayAPI, this would still result in a 503 even if you knew this case happened.

Ideally you'd be able to leverage Envoy health checks which would mean ditching an Envoy cluster per upstream like we do today and combine all of them into a single cluster utilizing localities which would let you set weights there but fail the health check appropriately.

stevesloka avatar Nov 01 '21 18:11 stevesloka

~~I agree that Contour should do this, but we haven't prioritized this work yet~~.

Edit: After thinking about this more in relation to #4125, I'd like to see a solution to this issue, but I have these concerns and thoughts:

  • Contour should not be editing the HTTPProxy that has been defined by a user. We need to respect the user's request, and do what's been asked for. Canary testing is currently done by setting a small weight (such as 1 percent), and checking to see if there are errors. If Contour silently makes those errors go away by editing the document (or even worse, silently changing the weight), users will have no way to know it. I think that surfacing the information that Contour has decided that your service is broken and is routing around it is vital, and requires more thought.
  • The above is why Gateway API has chosen the behavior that, if you ask a proportion of traffic to go to a service that's down, then that proportion of traffic should get a 503. That's what people who are running services currently expect.

If we're going to do something more clever than that, we're going to need to provide a way to tell people that we're not doing what they expect. I'm not willing to commit to this without a pretty detailed design of how we will educate users of Contour that a pretty common conception of how weighted services work will not hold if you're using Contour.

youngnick avatar Nov 01 '21 20:11 youngnick

The problem is that Envoy active health checks are enabled in the API, so the feature says, if the health checks fail for an upstream cluster, then traffic shouldn't route to that cluster. Contour doesn't respect that feature properly at the moment which is the reason for this issue.

The above is why Gateway API has chosen the behavior that, if you ask a proportion of traffic to go to a service that's down, then that proportion of traffic should get a 503. That's what people who are running services currently expect.

Are you suggesting @youngnick that if a cluster fails all of it's health checks, then traffic should still be routed to it? Or all endpoints are unhealthy (e.g. fail readiness probe), then traffic should still route? If that's the case then health checks should be removed from Contour and it should be documented that traffic will just fail.

stevesloka avatar Nov 15 '21 18:11 stevesloka

Are you suggesting @youngnick that if a cluster fails all of it's health checks, then traffic should still be routed to it? Or all endpoints are unhealthy (e.g. fail readiness probe), then traffic should still route? If that's the case then health checks should be removed from Contour and it should be documented that traffic will just fail.

What I'm saying is, that in the case that there are multiple, weighted services defined on a route, and one of those services has zero available endpoints, then the weighting should be respected, and 503s served at a proportional rate for that service.

So, if we have:

# rest of the proxy left out
spec:
  virtualhost:
    fqdn: weights.bar.com
  routes:
    - conditions:
      - prefix: /
      services:
        - name: s1
          port: 80
          weight: 10
        - name: s2
          port: 80
          weight: 90

And s1 has zero available endpoints, then 10 percent of requests should receive a 503 response.

On its face, this seems weird. Surely we can know that that service is down, and make it so that users don't see any problem, right? Yes, we can, but then we are not respecting what the HTTPProxy owner has asked us to do.

When someone defines weights on a service, they are saying "please send a proportion of traffic to the service". If we decide that, because that service is down, how do they know that? The canonical way for them to represent their request is the spec of the object. They've told us what they want, and it's our job to give them that.

Does giving them what they want in this instance produce a bad outcome for end users? Arguably yes. But can we get around the fact that someone has asked us to do this? No.

I'm in agreement that we should work to prevent inadvertent breakage and footguns as far as possible. But in this case, the actual user of the software - the person who is exposing their app with Contour - has asked us for something, and it is not right to believe that we know better than them and decide that that traffic should be transparently routed somewhere else.

Now, there are some other cases as wel:

  • You have two weighted services, and some endpoints of one of them is not available. In this case, it's part of our contract that we send a proportion of overall requests to the endpoints we have available. Contour has no business making a guess at how many endpoints should be available, so we need to send the traffic to the ones we have.

These cases are both extensions of a simpler case:

  • You have only one service, and all its endpoints are unavailable. In this case, we will end up serving a 503 for all requests. I don't think this is unreasonable.
  • you have only one service, and some of its endpoints are unavailable. In this case, we should send traffic to the endpoints we have available.

What you are proposing is for Contour to say "we know better than the person who is configuring their object, and you shouldn't be wanting 503s ever, so we're going to make it so you can't".

youngnick avatar Nov 15 '21 22:11 youngnick

there is a use-case where the healthcheck does not work as expected. dynamically setting the weight to 0 is a way to make it work. i agree it needs an explicit keyword in the configuration saying that the user want it to be dynamic weight for his service to not fail.

tannevaled avatar Nov 16 '21 07:11 tannevaled

I'm sorry, but I don't understand. In the event that the healthcheck doesn't work as expected, and we fix up the weights dynamically, do we expect that people will go and notice that the weights have been changed to 0? Or will they assume that their service is working correctly because all the requests are succeeding?

I guess what I'm saying is, I would like to see a design written up that says:

  • what this engineering effort is protecting against (that is, how can the system end up in the zero-endpoints-for-a-service state)?
  • If Contour takes up doing magic to automatically fix this sort of error, how do we tell the user that it's happened? One person's magic traffic fix is another's "This isn't what I asked for, why does this happen?" bug. At the very least, I think we'd need a switch to enable this extra magic, probably on a per-route basis, but I am not sure.
  • Where would this functionality be built? Adding changing weights into Contour means that Contour now needs to write to the spec of HTTPProxy, which up until now has been the exclusive province of the owning user. That's not the end of the world, but it's a change to the expected flow. If we add this functionality into Contour itself without changing the spec though, then it's even harder to meet the first goal of making sure the user is aware that something has changed.

This problem really only seems to arise in the "multiple-routes-to-a-service" use case, so I'd also like to know:

  • in the event that we switch the active health-checking to Kubernetes healthchecks (mentioned earlier), do we end up in a different state? Contour will still end up with a service with zero healthy endpoints, and a sibling service on the same route.
  • Should we perhaps consider a mode in which we treat the services just as a combined set of endpoints, and build an Envoy cluster out of the endpoints of both? That way, if one of the services has zero endpoints, that just means that some of the endpoints in the super-cluster are gone, and this problem doesn't arise.

tl;dr I can see that there's a problem, but I need to be shown a way of fixing the problem that doesn't break expectations around weighted clusters for canary testing and other uses.

youngnick avatar Nov 17 '21 04:11 youngnick

I'm going to ask a dumb question here: what are the use cases for having multiple services for a given route in the steady state (i.e. when not doing a canary rollout)? Why would someone want that configuration/what would they use it for?

skriss avatar Nov 17 '21 22:11 skriss

the idea is to have all applications deployed the k8s way. contour hosted in the internet exchange room (secured but very expensive hosting) and the services hosted on multiple regional institutional clouds (highly insecure but very cheap, thousands of cores). the head contour LBs route traffic to the mutually independent k8s backend clusters. the only thing that does not work is the health-checks. with this working we can do canary AND route all regional traffic from one region to another in case of failure (power outage of the DC).

tannevaled avatar Nov 18 '21 07:11 tannevaled

I'm sorry, but I don't understand. In the event that the healthcheck doesn't work as expected, and we fix up the weights dynamically, do we expect that people will go and notice that the weights have been changed to 0? Or will they assume that their service is working correctly because all the requests are succeeding?

I guess what I'm saying is, I would like to see a design written up that says:

* what this engineering effort is protecting against (that is, how can the system end up in the zero-endpoints-for-a-service state)?

* If Contour takes up doing magic to automatically fix this sort of error, how do we tell the user that it's happened? One person's magic traffic fix is another's "This isn't what I asked for, why does this happen?" bug. At the very least, I think we'd need a switch to enable this extra magic, probably on a per-route basis, but I am not sure.

* Where would this functionality be built? Adding changing weights into Contour means that Contour now needs to _write_  to the spec of HTTPProxy, which up until now has been the exclusive province of the owning user.  That's not the end of the world, but it's a change to the expected flow. If we add this functionality into Contour itself without changing the spec though, then it's even harder to meet the first goal of making sure the user is aware that something has changed.

This problem really only seems to arise in the "multiple-routes-to-a-service" use case, so I'd also like to know:

* in the event that we switch the active health-checking to Kubernetes healthchecks (mentioned earlier), do we end up in a different state? Contour will still end up with a service with zero healthy endpoints, and a sibling service on the same route.

* Should we perhaps consider a mode in which we treat the services just as a combined set of endpoints, and build an Envoy cluster out of the endpoints of both? That way, if one of the services has zero endpoints, that just means that some of the endpoints in the super-cluster are gone, and this problem doesn't arise.

tl;dr I can see that there's a problem, but I need to be shown a way of fixing the problem that doesn't break expectations around weighted clusters for canary testing and other uses.

if you only place weights on HTTPProxy targets, you express the desired repartitions. if one fail, the corresponding amount of traffic must fail (canary use case). if you combined this will health-checks, i think your intention is to have the corresponding repartitions when everything is working fine but have contour stop routing to unresponsive targets when they fail. i agree contour should not alter the weight expressed by the user. maybe we should have a weight corresponding to the desired amount of traffic and a weight (dynamically altered by contour) corresponding to the routing reality when activating health-checks.

tannevaled avatar Nov 18 '21 08:11 tannevaled

What I'm saying is, that in the case that there are multiple, weighted services defined on a route, and one of those services has zero available endpoints, then the weighting should be respected, and 503s served at a proportional rate for that service. And s1 has zero available endpoints, then 10 percent of requests should receive a 503 response.

Tend to agree with this. From a Contour admin’s pov, I made a conscious choice to route 10% traffic to service1 by specifying that weight. I likely know the initial availability of those services and endpoints but would like to be notified if anything changes.

xaleeks avatar Nov 23 '21 17:11 xaleeks

if you combined this will health-checks, i think your intention is to have the corresponding repartitions when everything is working fine but have contour stop routing to unresponsive targets when they fail

To me, I think of health checking as just that, it lets me know that things are unhealthy, but not necessarily alter the desired behavior. I can see your argument very well though.

Going back to your original request

If all endpoints were down on a service(service1), Envoy pods update its traffic rule to forward packets for only service0 until at least one endpoint on service1 is fine.

Feels like the crux of deciding whether to implement this behavior is if we see the envoy proxy as any other distributed system capable of things like auto re-balancing, remediating split brains etc. As a LB/proxy, maybe it makes sense but this is typically for cases where the backends are virtually indistinguishable and often hidden from the user for selection. These Services and Endpoints in k8s do not feel that way to me, the Contour admin should be well aware of what they represent and they have targeted uses for them.

xaleeks avatar Nov 23 '21 18:11 xaleeks

@youngnick moved this to Prioritized Backlog because we need to address this better in docs even if it doesn't land immediately in a release

xaleeks avatar Nov 23 '21 18:11 xaleeks