contour internal/builder: One bad route should not break other routes

I have an HTTPProxy which I'll share below. One route in the proxy is invalid since the port in the HTTPProxy route does not match the service it's configured to. Contour shows the proper status on that proxy, however, since one route is bad the entire proxy is thrown out causing all other routes to fail.

Since this route is part of my "root" proxy, it took down my entire site. =)

Root Proxy:

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: root
  namespace: projectcontour-roots
spec:
  includes:
  - conditions:
    - prefix: /blog
    - header:
        name: User-Agent
        notcontains: Chrome
    - header:
        name: User-Agent
        notcontains: Firefox
    name: blogsite
    namespace: projectcontour-marketing
  routes:
  - conditions:
    - prefix: /get
    services:
    - name: external
      port: 80  # <---- Invalid port!
  - conditions:
    - prefix: /secure
    services:
    - name: secureapp-default
      port: 80
  - conditions:
    - prefix: /secure
    - header:
        contains: Chrome
        name: User-Agent
    services:
    - name: secureapp
      port: 80
  virtualhost:
    fqdn: containersteve.com
    tls:
      secretName: containersteve-secret
status:
  currentStatus: invalid
  description: Service [external:80] is invalid or missing

$ kubectl get proxy -A                                                                                                                                                                                                  
NAMESPACE                  NAME       FQDN                 TLS SECRET              STATUS    STATUS DESCRIPTION
projectcontour-marketing   blogsite                                                valid     valid HTTPProxy
projectcontour-marketing   infosite                                                valid     valid HTTPProxy
projectcontour-roots       root       containersteve.com   containersteve-secret   invalid   Service [external:80] is invalid or missing

Dec 09 '19 19:12 stevesloka

/cc @jpeach @youngnick

I don't have a position on this; we've gone back and forth on what constitutes invalid deliberately and accidentally a few times in the past.

My ask is:

If the decision is to continue on invalid route, what is the status of the object? Is it in invalid? Likely not as its (partially) working. If so, how do we communicate this list of warnings to users?
Consistency; if a httpproxy is not invalid because once of its routes is missing a service, what other sets of things are we consider to be non fatal errors? This list should be enumerated up front before any code is cut.

Dec 09 '19 20:12 davecheney

removing the bug tag, this is working as intended -- although the intention is the point of debate.

Dec 09 '19 20:12 davecheney

I think the status (UX) of the object and functionality of the site are separate issues. In my example, I literally took down the entire fqdn for all my applications. Since the configuration was on a single route, the object can be in error, but at least it should retain the old config or ignore this specific route.

I do agree we should document the cases and then update the logic to match, but specifically for this issue, a single route on an object shouldn't break my entire cluster. This is another good spot for an Admission controller if we implemented one, we could catch this issue before it's even processed and leave the currently running-config intact which we can't do now.

Dec 09 '19 20:12 stevesloka

There are 2 separable issues here

Should HTTPProxy have all-or-nothing semantics?
If a HTTPProxy update has an error, what happens to the previous config?

For (1), I think that we have, and should have, all-or-nothing semantics. That is, we should never partially update routes on a HTTPProxy object.

For (2), if a HTTPProxy change results in a invalid config, then we should leave the previous config in place. This is tricky, both in implementation, and in communicating an appropriate status to the users.

I think that (1) is what Dave is describing above, and (2) is the behavior that Steve wants :)

Dec 09 '19 21:12 jpeach

Perhaps a solution is we change contour to always assume the service/port combination is valid. The linkage between RDS and CDS is by name alone so if it turns out the service isn’t available the route will be 504 by envoy at runtime. This isn’t much different in practice from a cluster going to zero active endpoints as the request arrives.

Dec 09 '19 21:12 davecheney

I believe encountered the same issue which brought down the routing for the entire ingress traffic. My case was a little different because one of the included HTTPProxy resources didn't exist, but afaik it boils down to the same issue as an invalid route.

This issue isn't prioritized on the Contour Project Board, so I wonder if there is any idea when work to resolve this issue is planned?

Sep 23 '20 13:09 niels-s

This work is probably also blocked behind #2495, since in that, we'll be able to add warnings to a HTTPProxy instead of just marking it invalid.

I agree that a single error in a single route should not render the whole HTTPProxy invalid, but the semantics of using includes make this a little tricky to solve in practice. Hopefully, being able to add warnings should be helpful.

@niels-s, as a user, how would you expect the HTTPProxy in @stevesloka's example above to work? I have some ideas, but I'd like to hear from you without influencing you first. :)

Sep 24 '20 00:09 youngnick

The ultimate goal from an operational standpoint is to make sure we don't interrupt traffic because of an invalid configuration.

In that sense, I agree with the approach of @jpeach. It's probably the most logical approach, using an apply all-or-nothing semantic, the tricky part would be to clearly communicate the invalid status to the users.

I guess everyone would choose to keep an "old" valid routing config in place instead of using a partial valid routing because that could also negatively impact your traffic.

Sep 24 '20 09:09 niels-s

I ran into this as well but differently than the experiences described here. All that has been discussed seems to revolve around applying a "bad" HTTPProxy. I found that it's possible to bring down an entire HTTPProxy VirtualHost when a backend Service is deleted. Once deleted, the HTTPProxy enters the Invalid status as the op mentioned and all routing for it is stopped.

Nov 19 '20 16:11 shopeonarope

I'd like to get this fixed up so we don't run into these types of problems as easily.

The problem with storing the "last good config" is where do you store the config? I guess the CRD has the last

An admission controller was the original idea to implement this which would allow invalid configs to be rejected before they are committed to the server. Doing that requires a DAG build in the controller or a call back to Contour.
Another approach is to reject the invalid bits from the object, but allow the others to pass exposing the warnings in Conditions. This wouldn't stop a bad config from taking down a portion of the route, but would limit to just that single path.

I think we could start with 2 as described, then investigate 1 which is much more complex of an integration.

@skriss @jpeach @youngnick thoughts?

Nov 19 '20 16:11 stevesloka

One idea I had was that we could use something like the Deployment -> ReplicaSet model -- each time the Deployment's spec changes, a new ReplicaSet is created, that's owned by the Deployment, to instantiate the new version of the spec. So for Contour, each time the HTTPProxy spec changed, a new HTTPProxyInstance (name TBD) could be created representing the new version of the spec. This could be helpful for supporting some of these versioning ideas, since you could keep the older HTTPProxyInstances around too in case the new ones were invalid, and serve traffic based on them until the newer spec was fixed.

Nov 19 '20 18:11 skriss

If the non-root proxy is invalid or missing , we can keep the route but drop the emit a 404. If the service isn't there we can emit a 503.

Nov 19 '20 18:11 jpeach

If the non-root proxy is invalid or missing , we can keep the route but drop the emit a 404. If the service isn't there we can emit a 503.

Fulfilling the intended path is probably a good idea since we're only doing prefixMatches, if we simply threw out the route, then another path would get the request which may not be what users want.

Nov 19 '20 18:11 stevesloka

xref #3071 #3039

Nov 19 '20 18:11 skriss

use something like the Deployment -> ReplicaSet model

Neat idea @skriss! I like the analogy to k8s types. =)

Nov 19 '20 18:11 stevesloka

I started with the 503 approach #3071 but got stalled. It began with just programming 503 route for missing backend services, but quickly realized that is just one small part of what can go wrong. Basically everything that today is error became warning, and if no clusters were left for the vhost that would be then error. I think it can make troubleshooting bit more difficult in some situations.

Unfortunately I got stalled due to other work but I will push what I have now to #3071 just in a minute. And by the way, I'm not offended if that work is abandoned :)

Nov 19 '20 18:11 tsaarni

I'm currently a -1 on the idea of having an intermediate resource representing the configuration, mainly because it feels to me like too much indirection. I agree that's how Deployments work, but that flow can be opaque if you're trying to figure out that something went wrong.

I could be convinced that the extra complexity is worth it, but I'd need to hear strong user feedback that serving 503s or 404s in the event a bad change is made is insufficient.

To put it another way, I worry that adding another CRD that kind of represents the status of the currently configured HTTPProxy is a lot of complexity, and we have lower-hanging fruit we could hit first. (For example, I don't think that a HTTPProxyInstance is useful unless we can be confident that configuration has actually been accepted into Envoy, which would entail solving the NACK problem, #1176).

Nov 25 '20 22:11 youngnick

I'm currently a -1 on the idea of having an intermediate resource representing the configuration, mainly because it feels to me like too much indirection. I agree that's how Deployments work, but that flow can be opaque if you're trying to figure out that something went wrong.

I could be convinced that the extra complexity is worth it, but I'd need to hear strong user feedback that serving 503s or 404s in the event a bad change is made is insufficient.

To put it another way, I worry that adding another CRD that kind of represents the status of the currently configured HTTPProxy is a lot of complexity, and we have lower-hanging fruit we could hit first. (For example, I don't think that a HTTPProxyInstance is useful unless we can be confident that configuration has actually been accepted into Envoy, which would entail solving the NACK problem, #1176).

Agree that we should start with low-hanging fruit. Was throwing it out there as more of a long-term idea, for us to discuss if it'd be useful or not.

Dec 08 '20 17:12 skriss

contour contour copied to clipboard

internal/builder: One bad route should not break other routes

contour
contour copied to clipboard