yarpc-go icon indicating copy to clipboard operation
yarpc-go copied to clipboard

Thread Keep-Alive configuration for gRPC transport

Open Haijuncao opened this issue 7 years ago • 7 comments

Problem: I am currently experimenting with yarpc, we use grpc tranport, depend on its internal connection management implemention.

We implement our own custom chooser (as our cluster is stateful/sharded). What is the recommended way to release peer? Ideally, chooser should just release it on request finish (the onFinish callback). As this peer may not be picked again. However, grpc connection management implemention will eagerly close connection the moment refcount drop 0. Forcing chooser to keep hold of peer after request finish.

It is entirely possible my understanding is wrong (as I am new to yarpc), so feel free to correct me if this is not the behavior.

Proposal: Instead of forcing chooser to keep hold of peer beyond request finish, can transport (grpc) connection management expose policy such as idle timeout. Only close connection after timeout time since ref count drop to 0. This way, chooser can release peer once request finish (it is not burdened with keeping track of state of the connection), if there is another request right after, the same connection can be reused, if the request pattern is spiky (or in our case, our cluster is stateful/sharded, request simply point to other part of the cluster), there is no more request for the peer for the timeout time, connection management can close the connection.

Beyond: idle timeout is just one aspect of connection management policy, I expect there may be others, would be great if the design (interface) take that into account.

Alternative:

  1. The alternative is to have idle timeout built into chooser (which is fully customizable), however, from a separation of concern perspective, chooser's role is to resolve request routing key to transport identity, policy such as connection idle timeout does not seem to belong.
  2. Still have chooser keep hold of peer, but expose transport level keep alive parameter (grpc connection option), chooser shall release peer once it receive push notification that connection is closed. Frankly, this is a very round about way to handle idle timeout, not ideal.

Haijuncao avatar Jan 09 '18 18:01 Haijuncao

I can speak to our design philosophy.

The HTTP transport implements keep-alive as you describe. I suspect that gRPC’s HTTP/2 layer does as well, but we have not exposed configuration for it.

Connection management is orthogonal to peer management. It’s the transport’s responsibility to hint at what peers are "available", and available can mean "connected" or "previously connected with no reason to suspect it can’t reconnect". The peer list may disregard availability notifications and send traffic to "unavailable" peers. The transport is then responsible for attempting to carry the request with a new connection.

The HTTP and gRPC transports have a connection management loop in the background, with a healthy and unhealthy mode. In the healthy mode, they are idle. In the unhealthy mode, they periodically attempt to establish a TCP connection. When they succeed, they move back to the healthy mode and discard the probe TCP connection. This is all done in YARPC, in the wrapper around the underlying transport, to avoid (but not entirely eliminate the possibility of) sending requests to unavailable peers, all in pursuit of more nines.

For HTTP, connections are closed if they idle longer than the Keep-Alive config. We may need to thread keep-alive configuration through HTTP/2. I’m not sure what the default behavior is.

So I think concretely, we need to pursue your second proposal, and thread Keep-Alive through YARPC’s gRPC transport configuration down to the HTTP/2 implementation.

cc @peter-edge

kriskowal avatar Jan 18 '18 17:01 kriskowal

Kris, Thanks for the response.

Can you define "peer management"? To me, chooser does service resolution, is it the same as "peer management"? My assumption is yes.

I am fine with going to alternative 2. However, does it mean chooser still need to hold on to peer instance beyond request finish (due to the refcount behavior)? in that case, chooser (peer management, or service resolution) is essentially coupled with "connection management".

Just a thought, if chooser interface merely return peer.Identity (e.g. hostport), then there is no chance of coupling.

Haijuncao avatar Jan 20 '18 18:01 Haijuncao

Any update? Any concerns with exposing keep-alive via transport config?

Once keep-alive is exposed, then maybe it does not make sense to immediately close the connection once refcount drop to 0?

Haijuncao avatar Feb 23 '18 04:02 Haijuncao

Any update, may I cooperate with this issue?

zhenyu-sha avatar Oct 21 '19 02:10 zhenyu-sha

Exposing keepAlive on the grpc OutboundConfig, mirroring the property that exists on the HTTP OutboundConfig, would be a welcome change! Do you have time to propose a change?

It should be sufficient, in my opinion, to thread the underlying connection management logic. There are a couple levels of abstraction we’re talking about, so this can be confusing. The gRPC notion of keepalive refers to connections. YARPC can retain a peer from gRPC, but gRPC decides whether to keep one or possibly more open connections to the peer. gRPC manages these connections and informs YARPC when the connection status changes.

But, you are right that YARPC “closes” its gRPC “client connection” (possibly multiple real connections beneath this) when no further peer lists refer to the peer.

YARPC’s peer list API is designed for load balancing, for which it’s generally desirable for every client to be fully connected. For sharding, as in the hashring peer chooser, we still try to keep the client fully connected.

It is entirely reasonable to drive RetainPeer and ReleasePeer on an individual request basis. If the desired outcome is for connections to remain alive between Choose/onFinish cycles, you would have options. You’d either need to implement your own “keep alive” in the peer list, essentially postponing ReleasePeer after onFinish by some duration. Or you’d have to implement another kind of peer keep-alive in the gRPC peer monitoring loop. I would favor the former, since the latter already has more than enough complications.

kriskowal avatar Oct 21 '19 16:10 kriskowal

https://godoc.org/google.golang.org/grpc/keepalive

kriskowal avatar Oct 21 '19 16:10 kriskowal

@kriskowal I want to enable Inbound server to config keep alive like https://github.com/grpc/grpc-go/blob/master/examples/features/keepalive/server/main.go#L43 I plan to add a few fields to TransportConfig, then pass all the way to newInbound: // TransportConfig configures a gRPC Transport. This is shared // between all gRPC inbounds and outbounds of a Dispatcher. // // transports: // grpc: // serverMaxConnectionIdle: 15s // serverMaxConnectionAge: 30s // serverMaxConnectionAgeGrace: 5s // serverKeepaliveTime: 5s // serverKeepaliveTimeout: 1s // backoff: // exponential: // first: 10ms // max: 30s // // All parameters of TransportConfig are optional. This section // may be omitted in the transports section. type TransportConfig struct { ServerMaxRecvMsgSize int config:"serverMaxRecvMsgSize" ServerMaxSendMsgSize int config:"serverMaxSendMsgSize" ClientMaxRecvMsgSize int config:"clientMaxRecvMsgSize" ClientMaxSendMsgSize int config:"clientMaxSendMsgSize" ServerMaxConnectionIdle time.Duration config:"serverMaxConnectionIdle" ServerMaxConnectionAge time.Duration config:"serverMaxConnectionAge" ServerMaxConnectionAgeGrace time.Duration config:"serverMaxConnectionAgeGrace" ServerKeepaliveTime time.Duration config:"serverKeepaliveTime" ServerKeepaliveTimeout time.Duration config:"serverKeepaliveTimeout" Backoff yarpcconfig.Backoff config:"backoff" } Does it make sense?if so , I am working on it. By the way , should we enable override transport config in Inbound or Outbound?

zhenyu-sha avatar Oct 22 '19 16:10 zhenyu-sha