raft
raft copied to clipboard
Add read handler to uv_send to detect remote socket close faster
While using the library in docker swarm or kubernetes environment we ran into issues of a simple three node cluster being in leader election state for about 20 minutes. The reason for this problem is a combination how of the way how tcp streams are used by the raft library in combination with the way docker warm or K8 deal with network socket on a service getting restarted. The raft library is currently using two different TCP stream between each pair of nodes in the cluster. Each node is using the TCP streams he opened as a client to send data to the other node. And each node reads data using the tcp stream he accepted as a server from the other node. In consequence all tcp streams in the raft library are used uni-directional. If a service container got restarted after a crash in a docker swarm or K8 environment the tcp endpoint got closed for the service container and a new service container with a new private IP gets spanned potentially on a different docker node. Now each remaining node will detect the remote close on the tcp streams he accepted as a server on the new read. But for the client stream used for sending he will still be able to add messages to the tcp output buffer without detecting the remote close. The remote close will only be detected, when the tcp output buffer of the node is filled up. In a very simply test case using a three node cluster we kill the current leader A and once a new leader is elected kill the new leader B as well. Now the remaining third node C will still be stuck with the remote closed outgoing connections to killed A and B. And A will be stuck with a remote closed outgoing connection to B. This way the outgoing connection is limited to leader elections messages and the cluster will be stuck for about 20 minutes in the leader election
For the current fix we simply added a uv_start_read on the outgoing tcp streams as well. As we don't expect any incoming data on this stream any triggering of the given callbacks are considered as a remote close and we close the tcp stream and socket. The existing reconnect code will handle the socket close and reconnect to the remove node.