antidote icon indicating copy to clipboard operation
antidote copied to clipboard

Error handling and recovery for DC failures

Open peterzeller opened this issue 4 years ago • 6 comments

When a DC temporarily fails for about 1 minute the other DCs also fail to communicate. After the failing DC has restarted the DCs needs to be joined again manually.

This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine.

Hi, we have antidote running on three machines, all directly connected to each other via a dedicated interface (so 2 interfaces per machine, with 3 total wires). It behaves correctly in the absence of failures; however, there is an issue when we test bringing down interfaces between the nodes. With 3 nodes in a cluster, bringing down the interfaces of Node 1, one at a time, causes an asymmetric connection between the other two connected nodes. The behavior we are witnessing is that updates are not replicated in both directions. Node 2 can send updates that are replicated to Node 3, but not vice versa. Neither node 2 nor node 3 have had their interfaces touched, and their dedicated link remains healthy. If we take down the interfaces on Node 1 all at once the cluster stays healthy. We are thinking that this could be because between when Node 1 loses connection to Node 2 and when Node 1 loses connection to Node 3, Node 1 is reporting Node 2's “failure” to Node 3, causing Nodes 1 and 3 to believe they are a majority partition. Then when Node 1 loses connection to Node 3, Node 3 believes it is alone. What is surprising to us is that Node 2's updates continue to reach Node 3 in this scenario, but not the reverse. Have we hit upon the correct diagnosis for our strange behavior? If we have, do you folks know how we can resolve this network state ?

We're bringing down the connection within a single datacenter, and restoring it after about 1 minute (just long enough for timeouts to fire) we do need to manually resubscribe to the restored node when it returns if we keep it down long enough, but that's not what worries us what worries us is that two unrelated nodes experience communication interruption after we take one node down. all nodes are in the same DC. we had assumed that the only possible effect of restricting communication to a single node in the DC is that the remaining healthy members would unsubscribe from that node. we did not anticipate that this could cause healthy nodes to unsubscribe from each other.

peterzeller avatar May 27 '20 10:05 peterzeller

@shamouda Since you've been working on adding redundancy and fault tolerance to DCs, maybe you can comment if you also observed this kind of problem and if your work will fix this.

peterzeller avatar May 27 '20 10:05 peterzeller

Peter,

I read Matthew’s description differently. He talks about replicas, which makes me thing that each node is emulating a full DC. The three nodes represent three DCs. Doesn’t this look a lot like the known issue with lack of anti-entropy?

		Marc

Le 27 mai 2020 à 12h02, Peter Zeller [email protected] a écrit :

When a node within a DC temporarily fails for about 1 minute the DC does not recover automatically after the node comes back up.

The cluster needs to be joined again manually. One would expect that with one node down the DC would still be able to serve requests for keys not stored on the failed node. Apparently that is not the case. This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine.

Hi, we have antidote running on three machines, all directly connected to each other via a dedicated interface (so 2 interfaces per machine, with 3 total wires). It behaves correctly in the absence of failures; however, there is an issue when we test bringing down interfaces between the nodes. With 3 nodes in a cluster, bringing down the interfaces of Node 1, one at a time, causes an asymmetric connection between the other two connected nodes. The behavior we are witnessing is that updates are not replicated in both directions. Node 2 can send updates that are replicated to Node 3, but not vice versa. Neither node 2 nor node 3 have had their interfaces touched, and their dedicated link remains healthy. If we take down the interfaces on Node 1 all at once the cluster stays healthy. We are thinking that this could be because between when Node 1 loses connection to Node 2 and when Node 1 loses connection to Node 3, Node 1 is reporting Node 2's “failure” to Node 3, causing Nodes 1 and 3 to believe they are a majority partition. Then when Node 1 loses connection to Node 3, Node 3 believes it is alone. What is surprising to us is that Node 2's updates continue to reach Node 3 in this scenario, but not the reverse. Have we hit upon the correct diagnosis for our strange behavior? If we have, do you folks know how we can resolve this network state ?

We're bringing down the connection within a single datacenter, and restoring it after about 1 minute (just long enough for timeouts to fire) we do need to manually resubscribe to the restored node when it returns if we keep it down long enough, but that's not what worries us what worries us is that two unrelated nodes experience communication interruption after we take one node down. all nodes are in the same DC. we had assumed that the only possible effect of restricting communication to a single node in the DC is that the remaining healthy members would unsubscribe from that node. we did not anticipate that this could cause healthy nodes to unsubscribe from each other.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AntidoteDB/antidote/issues/422, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABS23H7WX6TGCZHSEVMKCJLRTTQK7ANCNFSM4NL7KSDA.

marc-shapiro avatar May 27 '20 11:05 marc-shapiro

Actually I had gotten confused myself about our setup. Marc is right, we have each node representing its own DC.

mpmilano avatar May 27 '20 22:05 mpmilano

We are using the native (Erlang) API to connect each DC via RPC. Not one of the clients that Peter asked about on Slack that use createDc or connectDcs.

Mrhea avatar May 27 '20 22:05 Mrhea

Thanks for the clarification.

Maybe you can try if the problem is fixed with the changes from pull request https://github.com/AntidoteDB/antidote/pull/421.

You can either compile the branch yourself or use the Docker image peterzel/antidote:interdc_log.

peterzeller avatar May 28 '20 09:05 peterzeller

Unfortunately those changes did not change the behavior we are seeing.

Mrhea avatar Jun 02 '20 19:06 Mrhea