envoy-mobile icon indicating copy to clipboard operation
envoy-mobile copied to clipboard

proposal: remove the h2 raw feature

Open Augustyniak opened this issue 2 years ago • 4 comments

Proposal: Remove the addH2RawDomains feature from Envoy engine builders.

Envoy engine builders allow its users to specify the list of h2 raw domains. For requests that go to domains defined as 'h2 raw domains' Envoy Mobile skips the ALPN negotiation and goes straight to enforcing h2 connections.

https://github.com/envoyproxy/envoy-mobile/blob/3b761c6ac51e04d6b645cb8de85eb5abd51a86bc/library/swift/EngineBuilder.swift#L290

https://github.com/envoyproxy/envoy-mobile/blob/3b761c6ac51e04d6b645cb8de85eb5abd51a86bc/library/swift/EngineBuilder.swift#L290

The feature was introduced originally back when Lyft was trying to address performance issues related to a potentially unlimited number of connections being created in response to application layer starting an unlimited number of requests to a new domain in a short period of time. Enforcing HTTP/2 for domains that we had a full control over allowed us to ensure that the number of connections was capped and kept at a reasonably low number. Soon after we added support for h2 raw domains @alyssawilk introduced an option for us to specify the maximum number of connections on per host basis. Envoy Mobile quickly adopted this feature as most apps either do not use HTTP/2 domains exclusively or do not have control over all domains their applications talks to.

TL;DR If an experiment - mentioned below in cons section - confirms that h2 raw can be disabled at Lyft without degrading any of our business metrics we will disable it. Since there are no other companies that currently depend on the feature or are planning to use it in the near future we are going to remove the feature from the engine to not have to support it anymore (it's not free) and to make it easier for ourselves to start using stats coming from Envoy's virtual clusters (more about this in pros section).

Detailed discussion of the pros and cons of the removal of h2 raw domains feature Pros:

  • Simpler code, less code is simpler code. The maintenance of EM's public API becomes cheaper.
  • Simpler setup of Envoy's virtual clusters. In fact, the https://github.com/envoyproxy/envoy-mobile/pull/2088 that introduced the h2 raw domains feature broken (disabled) the emission of stats coming from Envoy Mobile virtual clusters. That's because it renamed the virtual host from api to another name and increase the number of virtual clusters from 1 to 2 without updating the list of enabled EM stats in https://github.com/envoyproxy/envoy-mobile/blob/main/library/common/config/config.cc#L486-L487.
  • Simpler set up of stats coming from Envoy virtual clusters. Lyft is trying to benefit from having access to Envoy's virtual clusters and the complexity in the setup of virtual clusters that was added as the result of h2 raw domain work makes it harder for us to accomplish out goals. That's because stat names coming from virtual clusters include the name of the virtual hosts and we have 2 virtual hosts (instead of 1) to support h2 raw domains. That means that a person that wants to display stats for a given request needs to know whether this request hits an h2 raw domain or not in order to find out what's the full name of the stats that they are looking for. Envoy docs on the names of stats coming from virtual clusters may be found at https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#config-http-filters-router-vcluster-stats.
  • a simpler story for Envoy Mobile users - there would be no (optional) need to configure any domains with the use of addH2RawDomains and instead all requests would go through a well defined ALPN codepath which is an industry standard.

Cons:

  • Potentially reduced performance when the first request to a new domain is started and Envoy Mobile needs to perform an ALPN negotiation to figure out whether to use HTTP/1 or HTTP/2. Over time this con will become almost non-existent as we will be able to cache the result of ALPN negations between app launches with the help of on a disk storage. Lyft is going to run an experiment which will verify whether the removal of h2 raw leads to a degradation in app's performance.

Augustyniak avatar Sep 07 '22 17:09 Augustyniak

let's do it! It AFIK shouldn't reduce performance at all, as ALPN doesn't add lantecy.

When we do this can we get rid of the base_http2 cluster entirely?

alyssawilk avatar Sep 07 '22 18:09 alyssawilk

It AFIK shouldn't reduce performance at all, as ALPN doesn't add lantecy.

That's good to know 👍 I guess one thing that I would be a little bit afraid of is that when you start a lot of (let's say 100) requests to a given host for the first time you will end up creating up to the maximum number of connections allowed on per-host basis (defaults to 7). With enforced HTTP/2 the number of connections would be equal to 1. So there is a - I think small - chance that the creation of addition connections will lead to perf degradation. What do you think?

When we do this can we get rid of the base_http2 cluster entirely?

It'a required although not a sufficient step for us to be able to remove base_h2 cluster I think. We would need to remove the functionality that allows Envoy users to enforce the use of HTTP/2 with the use of HTTP headers. Not sure how useful this functionality is in a word where we have a working ALPN - especially if we are planning to add caching for the results of ALPN negotiation.

That's implemented in here https://github.com/envoyproxy/envoy-mobile/blob/3b761c6ac51e04d6b645cb8de85eb5abd51a86bc/library/common/http/client.cc#L699

Augustyniak avatar Sep 07 '22 19:09 Augustyniak

oh, I "fixed" that a while back

https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.23/v1.23.0#minor-behavior-changes

changed HTTP/2 connection pooling and the ALPN pool to remember the number of streams allowed by the endpoint and cap multiplexed streams for subsequent connections based on that. With that working, defaulted the ALPN pool to assume HTTP/2 will work, as it will only incur a latency hit once until the TLS handshake is complete, and then will cache that the effective stream limit is 1. This behavioral change can be revered by setting envoy.reloadable_features.allow_concurrency_for_alpn_pool to false.

So in the new world, we default to assuming HTTP/2. So in the absence of persisted ALPN I guess there's a latency hit for HTTP/1.1 where on cold start you assume HTTP/2 works, fetch one connection, and don't start the TCP/TLS handshakes for connections 2-7 until you get far enough along in the TLS handshake to know that HTTP/2 doesn't work.

alyssawilk avatar Sep 07 '22 19:09 alyssawilk

oh, I "fixed" that a while back

This is amazing. Thank you for a detailed explanation. At Lyft we care mostly about HTTP/2 only since most of our requests go to our domain and it does support HTTP/2 only. At this point I am not sure whether we even need to experiment with this on our side - to me it sounds that the removal of the h2 raw feature is the way to go and it should not negatively impact us at Lyft.

  • For HTTP/2 domains ALPN implementation should work at least as well as h2 raw implementation. No additional latency should be introduced. The number of connections should be the same in both cases too.
  • For HTTP/1 domains ALPN implementation works while h2 raw does not work.
  • ALPN implementation does not require any prior configuration on Envoy user side while h2 raw implementation does require any explicit list of h2 raw domains (which ends up being hardcoded on the client side which is not great).

Augustyniak avatar Sep 07 '22 20:09 Augustyniak