envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Improve client side TLS session reuse

Open lambdai opened this issue 1 year ago • 9 comments
trafficstars

We are running envoy as host proxy to enforce mTLS. We find that Envoy could be hammer by TLS handshake. We enabled max_session_keys in UpstreamTlsContext and see signicant CPU utilization drop in experiment.

However, in production the max_session_keys is not very pratical, because the session key pool is shared by all the endpoint in the cluster and there is no hint to select the available session.

For example, if the cluster has two endpoints, 10.0.0.1:80 and 10.0.0.2:80. When 10.0.0.1:80 is selected by the load balancer, ssl socket may pick the saved session that are created with 10.0.0.2:80. See https://github.com/envoyproxy/envoy/pull/4791

In the worse secanrio, if the two endpoints are selected in turn, the session reuse rate is 0. In realworld, the session reuse rate drop to 1/N where N is the endpoints in the cluster.

My proposal is to add hint when selecting the ssl session. A typical hint is the upstream endpoint address.

The hint can be stored in the transport socket options which is already attached to the SSL_SESSION, and replace the per ClientContext std::deque<bssl::UniquePtr<SSL_SESSION>> session_keys_ by std::unordered_map<SessionHint, std::deque<bssl::UniquePtr<SSL_SESSION>>> session_key_map_

  1. when a SSL_SESSION is created with SSL find the hint of the SSL and cache the session in the session_key_map_
  2. when a SSL is created via ClientContextImpl::newSsl(const Network::TransportSocketOptionsConstSharedPtr&) use the transport socket option to find hint, set session to the SSL if there is available a SSL_SESSION

The alternative method is spread the session among TLS servers, but this method not only expand the compromise size, it also introduce shared storage as a depedencies. The extra dependency does not work when Envoy is close to the lowest layer.

lambdai avatar Apr 12 '24 18:04 lambdai

Dumb question: why aren't session keys stored per socket? why are they shared between multiple SSL sockets?

kyessenov avatar Apr 12 '24 18:04 kyessenov

Dumb question: why aren't session keys stored per socket? why are they shared between multiple SSL sockets?

SSL socket stands for a TLS connection. Without reusing the stored SSL_SESSION, a new SSL_SESSION must be built via the expensive asymmetric cryptography.

The session_key refers to the result of the asymmetric cryptography method. At the client side, the session_key is shared by new SSL socket(s) and wish that the server could restore the session for the following ssl connections.

Do I answer your question?

lambdai avatar Apr 12 '24 18:04 lambdai

@lambdai Yes, I understand the session keys allow skipping the handshake. I'm just surprised the keys are not associated with an endpoint already, and asking why it was done this way.

kyessenov avatar Apr 12 '24 19:04 kyessenov

The most recent session reuse efforts was done by @PiotrSikora in https://github.com/envoyproxy/envoy/blame/main/source/common/tls/context_impl.cc#L774

Let me quickly add a PR to explain my proposed interface change

lambdai avatar Apr 12 '24 22:04 lambdai

I prefer TransportSocketOptions because we can populate tag value out of TLS transport socket for example, hash proxy protocol values in the tag (or not).

the tls socket config should config whether to override the external tag, ignore the external tag, or add endpoint address only if external tag is not provided

lambdai avatar Apr 12 '24 23:04 lambdai

@kyessenov Do you think https://github.com/envoyproxy/envoy/pull/33517 is a good start?

I have some early code proving that there is no memory access issue and much better session use.

lambdai avatar Apr 15 '24 17:04 lambdai

@lambdai Yes, I understand the session keys allow skipping the handshake. I'm just surprised the keys are not associated with an endpoint already, and asking why it was done this way.

This is similar (or the same) as a very old issue: #5073. The only reason this hasn't been done, is because nobody put in the time to do it.

ggreenway avatar Apr 18 '24 00:04 ggreenway

@ggreenway Thank you for the pointer and review! I add my comment to the interface ish PR. Please let me know if my comments make sense.

lambdai avatar Apr 18 '24 19:04 lambdai

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 18 '24 20:05 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar May 26 '24 00:05 github-actions[bot]