dskit icon indicating copy to clipboard operation
dskit copied to clipboard

Proposal: make memberlist TCP Transport non blocking

Open pracucci opened this issue 2 years ago • 10 comments

This proposal is based on a theory I have, but haven't proved if it can significantly improve the messages propagation yet.

The dskit's memberlist client propagate changes (e.g. hash ring changes) using broadcast messages: https://github.com/grafana/dskit/blob/ead3f9308bb7b413ce997182dd4d7c6e038bc68f/kv/memberlist/memberlist_client.go#L1042-L1058

Messages are push to a queue, along with messages received from other nodes and enqueue for re-broadcasting, and then a single goroutine periodically pulls messages from the queue and send them to some other (random) nodes: https://github.com/grafana/memberlist/blob/09ffed8adbbed3ee3ace7de977c5e1cade90da87/state.go#L604-L626

The gossip() function, responsible to pick up messages from the queue and send them to some random nodes, is called at a regular interval, configued by -memberlist.gossip-interval (200ms by default): https://github.com/grafana/memberlist/blob/09ffed8adbbed3ee3ace7de977c5e1cade90da87/state.go#L131-L135

For each gossip interval, the gossip() function picks up messages up to UDPBufferSize, which is hardcoded to 10MB in dskit (because dskit uses a custom transport based on TCP): https://github.com/grafana/dskit/blob/ead3f9308bb7b413ce997182dd4d7c6e038bc68f/kv/memberlist/memberlist_client.go#L420-L422

We assume that we broadcast new messages and re-broadcast received message every -memberlist.gossip-interval=200ms but that's not guaranteed: the gossip() function make take significantly longer, because the dskit's custom TCP transport blocks on each call to rawSendMsgPacket(). In particular, the call to rawSendMsgPacket() blocks for the whole duration it takes to the TCP transport to create a new TCP connection, write the data and close it: https://github.com/grafana/dskit/blob/ead3f9308bb7b413ce997182dd4d7c6e038bc68f/kv/memberlist/tcp_transport.go#L438-L439

For example, if it can't reach a node and the connection times out the gossiping is blocked for 5 seconds (when running with the default config -memberlist.packet-dial-timeout=5s).

Proposal

The gossip() in memberlist library assumes an UDP transport, where sending packet is a "fire and forget", and basically doesn't block for a long time. On the other side, the dskit's TCP transport blocks even if it swallows any error: https://github.com/grafana/dskit/blob/ead3f9308bb7b413ce997182dd4d7c6e038bc68f/kv/memberlist/tcp_transport.go#L430-L432

In pratice, it's currently the worst of the two worlds: with TCP we can detect more delivery errors at the cost of blocking the call, but we never return the error.

I propose to make TCP transport's WriteTo() non-blocking. It preserves the current "fire and forget" behaviour without blocking the gossiping of messages to other nodes if a node is unresponsive.

pracucci avatar Jul 18 '22 12:07 pracucci