envoy Envoy 1.21.0 broke the ability to perform plaintext healthchecks against upstream TLS clusters

trafficstars

Our standard configuration for an upstream cluster is to configure TLS (and ALPN) but also include a transport_socket_match which we use to perform plain-text healthchecks against upstream clusters. TLS healthchecks are very expensive and not viable for us. We follow a scheme which is similar to what's documented here.

This behavior broke following release 1.21.0. The release notes stated:

"auto_config: auto_config now verifies that any transport sockets configured via transport_socket_matches support ALPN. This behavioral change can be temporarily reverted by setting runtime guard envoy.reloadable_features.correctly_validate_alpn to false."

Is there another way of achieving what we need here? As is, this change is seems overly aggressive -- if you want to use ALPN at all, all of your transport socket options must also do so -- i.e. no plaintext healthchecks are possible.

Aug 29 '22 16:08 jamesmulcahy

Here is a cut-down, example configuration, which demonstrates this issue. envoy exits with:

ALPN configured for cluster meshcontrol_xds which has a non-ALPN transport socket matcher : name: "meshcontrol_xds"

{
  "node":  {
    "id":  "proxyd-e578490b-cb02-46c7-94f8-99818f7caa60",
    "cluster":  "mesh-snfortio",
    "metadata":  {
      "region":  "us-east-1"
    },
    "locality":  {
      "region":  "us-east-1",
      "zone":  "us-east-1e"
    }
  },
  "staticResources":  {
    "listeners":  [
    ],
    "clusters":  [
      {
        "transportSocketMatches":  [
          {
            "name":  "disableTLS",
            "match":  {
              "disableTLS":  true
            },
            "transportSocket":  {
              "name":  "envoy.transport_sockets.raw_buffer"
            }
          }
        ],
        "name":  "meshcontrol_xds",
        "type":  "STRICT_DNS",
        "connectTimeout":  "1s",
        "loadAssignment":  {
          "clusterName":  "meshcontrol_xds",
          "endpoints":  [
            {
              "lbEndpoints":  [
                {
                  "endpoint":  {
                    "address":  {
                      "socketAddress":  {
                        "address":  "redacted.net",
                        "portValue":  8980
                      }
                    }
                  }
                }
              ]
            }
          ]
        },
        "healthChecks":  [
          {
            "timeout":  "10s",
            "interval":  "60s",
            "unhealthyThreshold":  10,
            "healthyThreshold":  10,
            "tcpHealthCheck":  {},
            "unhealthyInterval":  "2s",
            "eventLogPath":  "/run/proxyd/healthchecks.sock",
            "transportSocketMatchCriteria":  {
              "disableTLS":  true
            }
          }
        ],
        "typedExtensionProtocolOptions":  {
          "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":  {
            "@type":  "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "autoConfig":  {}
          }
        },
        "commonLbConfig":  {
          "healthyPanicThreshold":  {}
        },
        "transportSocket":  {
          "name":  "envoy.transport_sockets.tls",
          "typedConfig":  {
            "@type":  "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
            "commonTlsContext":  {
              "tlsCertificateSdsSecretConfigs":  [
                {
                  "name":  "metatron_client_cert",
                  "sdsConfig":  {
                    "path":  "/run/proxyd/secrets/metatron_client_cert.json",
                    "resourceApiVersion":  "V3"
                  }
                }
              ],
              "combinedValidationContext":  {
                "defaultValidationContext":  {},
                "validationContextSdsSecretConfig":  {
                  "name":  "metatron_client_ca",
                  "sdsConfig":  {
                    "path":  "/run/proxyd/secrets/metatron_client_ca.json",
                    "resourceApiVersion":  "V3"
                  }
                }
              },
              "alpnProtocols":  [
                "h2",
                "http/1.1"
              ]
            }
          }
        },
        "upstreamConnectionOptions":  {
          "tcpKeepalive":  {
            "keepaliveProbes":  3,
            "keepaliveTime":  90,
            "keepaliveInterval":  10
          }
        },
      },
    ]
  },
  "dynamicResources":  {
    "ldsConfig":  {
      "ads":  {},
      "resourceApiVersion":  "V3"
    },
    "cdsConfig":  {
      "ads":  {},
      "resourceApiVersion":  "V3"
    },
    "adsConfig":  {
      "apiType":  "DELTA_GRPC",
      "transportApiVersion":  "V3",
      "grpcServices":  [
        {
          "envoyGrpc":  {
            "clusterName":  "meshcontrol_xds"
          }
        }
      ]
    }
  },
  "layeredRuntime":  {
    "layers":  [
      {
        "name":  "static",
        "staticLayer":  {
          "envoy.reloadable_features.correctly_validate_alpn":  true,
        }
      },
      {
        "name":  "admin",
        "adminLayer":  {}
      },
      {
        "name":  "rtds-dynamic",
        "rtdsLayer":  {
          "name":  "rtds-dynamic",
          "rtdsConfig":  {
            "ads":  {},
            "resourceApiVersion":  "V3"
          }
        }
      }
    ]
  }
}

Aug 29 '22 16:08 jamesmulcahy

Yeah this seems like a regression. IIRC @alyssawilk added the auto_config code. Alyssa any thoughts on this?

Aug 30 '22 14:08 mattklein123

Those checks were added because the auto_config is flat out not expected to work correctly with plaintext sockets. You can't ALPN without alpn support. I think it's largely luck it worked and I have concerns about trying to support it. We could add an (undocumented ) bootstrap workaround to avoid the config checks for you, but I'm concerned whatever code path caused this to work will fail at some point in the future.

Aug 31 '22 14:08 alyssawilk

I don't profess to be an expert on the code paths involved here, but for our use case of TCP healthchecks, I'm surprised that the presence of auto_config (and in turn ALPN/H2 negotiation) would matter?

It looks like the USE_ALPN flag is checked only in the ClusterImplBase (as part of the validation which is failing), and in one other location -- ClusterInfoImpl::upstreamHttpProtocol. But the latter surely wouldn't come into play for a plaintext TCP healthcheck socket?

It seems like there would have to be some sort of layering violation for there to be a codepath where this really ended up being an issue -- but I trust you're the expert here, @alyssawilk. I just wanted to be explicit that our interest is in plaintext TCP -- for a plaintext HTTP check, I can see how this might be more problematic.

As is, we're stuck on an older version of envoy due to this. If an undocumented bootstrap flag is the path forward, we'll happily push a PR for that - but I wonder if relaxing the constraint more generally makes sense?

Aug 31 '22 15:08 jamesmulcahy

cc @snowp @mattklein123 @RyanTheOptimist thoughts? summary:

The mixed connection pool is designed for TLS ALPN. It does "fails open" so does HTTP/1.1 if ALPN is not configured https://github.com/envoyproxy/envoy/blob/main/source/common/http/mixed_conn_pool.cc#L45

The only intended use case was for ALPN over TLS https://github.com/envoyproxy/envoy/blob/main/api/envoy/extensions/upstreams/http/v3/http_protocol_options.proto#L96

At some point we realized the validation for "make sure if you configure ALPN you're using transports sockets that do ALPN" was broken and fixed it, which is causing an issue for @jamesmulcahy and co who are doing mixed TLS and cleartext communication and want ALPN iff TLS else HTTP/1.1

Do we want to

allow this by default (configure ALPN via AutoHttpCofig but only have it work if you set up transport sockets correctly)
allow it if explicitly configured (have a boolean for "allow non-alpn transport sockets for folks who want mixed traffic)
only allow ALPN to do ALPN (and have a hidden runtime override for legacy use)

I proposed (3) above as doing (2) mildly offends me, but given @jamesmulcahy seems to have a legit working use case, I'm leaning towards stifling my inner pedant and doing (2) :-P Would definitely like others' thoughts first.

Aug 31 '22 16:08 alyssawilk

It certainly feels ... dodgy ... to have a socket pool that is designed for TLS ALPN also be able to do cleartext. That makes me inclined to do option 3. But if we have a use case which we can't accomplish without that behavior, then I guess option 2 makes sense. If there's any way to avoid option 2, though, I think that would be ideal.

Aug 31 '22 17:08 RyanTheOptimist

In reading through this thread in detail, I guess my first question is why are you using auto config at all? Is the issue that this is a simplified example and in your real example you truly don't know whether the upstream supports HTTP/1 or HTTP/2 and need to use ALPN? Can you just hard code HTTP/1 or HTTP/2 for your use cases?

Aside from that, I do feel that using plain text health check is a legitimate use case that we should support in some way. I suppose one way would be to have an explicit transport socket and protocol options for health checking that is disjoint from all of the transport socket matching stuff. This is probably the cleanest and most proper solution, but this would be a huge change that I'm sure no one has appetite for.

Barring that, I think we have to do (1) or (2) above. Regarding (1), it actually doesn't seem too terrible to me to have some kind of special case for plain text which basically says that if you select plain text we ignore any ALPN requirement and just "auto" on HTTP/1 or something like that, and have it be clearly documented. In the above example HTTP health checks are not even being used, just TCP health checks, so one of the protocol stuff even comes into play.

Mostly just brainstorming here but I think what I would really be after is some way of making HC settings more disjoint from the rest of the cluster settings if the user wants, which seems like a reasonable request.

Sep 01 '22 14:09 mattklein123

I want to re-iterate that my use case here is for TCP healthchecks. We are not using plaintext HTTP healthchecks. It's been mentioned a couple of times about that we're looking for a (ALPN+TLS or Plaintext+HTTP) -- but that's not our use case. I dare say someone else may have that use case -- it would have worked before, but it's not the use case I'm representing here.

Given that context, I think the answer to @mattklein123's question is more clear -- We're using auto_config because we don't know for sure if the upstream talks H1 or H2, and we want to use H2 for real traffic where possible. We're using the raw_buffer transport for plaintext TCP healthchecks against that endpoint, because we cannot afford the cost of TLS healthchecks.

All "real" traffic is H1/H2 TLS. All healthcheck traffic is plaintext TCP.

Regarding (1), it actually doesn't seem too terrible to me to have some kind of special case for plain text which basically says that if you select plain text we ignore any ALPN requirement and just "auto" on HTTP/1 or something like that, and have it be clearly documented

This is essentially how things worked before the correctly_validate_alpn check was added. A half-step back towards that would be to just ignore any raw_buffer transports as part of that check, and document the exclusion as @mattklein123 suggests.

Sep 01 '22 17:09 jamesmulcahy

I want to re-iterate that my use case here is for TCP healthchecks. We are not using plaintext HTTP healthchecks. It's been mentioned a couple of times about that we're looking for a (ALPN+TLS or Plaintext+HTTP) -- but that's not our use case. I dare say someone else may have that use case -- it would have worked before, but it's not the use case I'm representing here.

Oh, thanks for the clarification. I think I completely missed that detail earlier. Sorry for derailing.

Sep 01 '22 18:09 RyanTheOptimist

Ok I don't think there's consensus here so I'm going to lean harder on (2) and say folks can add an option for do_not_enforce_alpn which will fail open to HTTP/1.1

@jamesmulcahy sorry for the delay thus far. You said you were up for the change so I'll leave the ball in your court but as I introduced the breaking change lmk if you want me to pick it up and I will ASAP!

Sep 19 '22 18:09 alyssawilk

Hi @alyssawilk. Our priorities has changed a bit since the last time we updated this thread, but is it feasible to address this soon? Currently we are going ahead with the upgrades by omitting this commit: https://github.com/envoyproxy/envoy/commit/be87cd4399433216848731021f99210547889738 But I think this might cause different issues for us down the road.

I see that you are suggesting we go with: "allow it if explicitly configured (have a boolean for "allow non-alpn transport sockets for folks who want mixed traffic)" as the solution, so if that's the case, is the fix as simple as re-introducing a flag like the one above?

Sep 05 '23 19:09 cancecen

Given no one else has complained (so I think this is a pretty narrow use case) would you be Ok with me going with (3), hidden runtime option y'all could set?

Sep 06 '23 17:09 alyssawilk

Yes, thank you that works. I assume we can rely on that hidden runtime option for the foreseeable future, because we don't have plans on removing the TCP healthchecks or doing a TLS handshake in our healthchecks as of now.

Sep 06 '23 17:09 cancecen

I was asked to review the PR, and just read this issue and have a few thoughts.

Yes, this is a corner case
I wouldn't have expected that setting http options on the cluster would impact non-http uses of the cluster. I believe in most cases, the same cluster can be used for both http and tcp connections, and the http options have only ever affected the http connections. So it's surprising that setting an http option here (auto_config) disallows another option that is only used by non-http connections (tcp healthchecks)
Not having this validation (ie allowing this transport socket match in all cases) is a worse experience overall for most uses cases

I think ideally I'd rather see a boolean option in the http options (right near where auto_config is set) for this toggle. Putting it nearby means users are more likely to notice it in the docs, so someone else who runs into this error is more likely to find the solution.

But that being said, this is a corner case that few people are likely to hit, so I'd be ok just merging the runtime-key fix as-is.

Sep 07 '23 20:09 ggreenway

The problem with this is we can't clearly designate some matchers as for TCP and others for HTTP. Largely I think if you want to use ALPN, you should dup your cluster for TCP traffic rather than try to have mixed HTTP relying on a feature that may or may not exist if you have a bug in your matchers. I don't think there's a strict advantage to reusing the cluster for connections that won't be pooled together and there's a definite opportunity for foot-guns, which is why I opdated for the undocumented option. I think it should be discouraged and the default productoin set-up should be don't mix ALPN HTTP traffic with raw TCP, and we keep this around hidden to support legacy use cases.

Sep 11 '23 12:09 alyssawilk

envoy envoy copied to clipboard

Envoy 1.21.0 broke the ability to perform plaintext healthchecks against upstream TLS clusters

envoy
envoy copied to clipboard