rapid icon indicating copy to clipboard operation
rapid copied to clipboard

Partition detection

Open manuelbernhardt opened this issue 4 years ago • 3 comments

Summarizing the discussion I've had with @lalithsuresh per e-mail on this topic so far:

In the event of a 2-way network partition, consensus can be reached on the majority but not on the minority side. Note that in order to do so reliable, the reinforcement mechanism described in the last paragraph of section 4.2 of the paper needs to be implemented.

On the minority side, no consensus can be reached. I currently have two ideas that would allow to make progress (where progress means informing the application that we are out).

Consensus timeout

After fast-paxos times out and paxos times out, conclude that we can't reach consensus fast enough and leave. Note that I am not sure yet that even with the reinforcement mechanism as described in the paper there'd be a way to reach a cut detection under those circumstances.

Extend the reinforcement mechanism to include partition detection

I think it may be possible to evolve the reinforcement mechanisms so that:

  • the timeout starts to count down only once there have not been any new alerts received as a whole (and not on a per-subject basis)
  • all reinforced-REMOVEs are broadcasted at once. This broadcast retries alerts and/or has a higher timeout (i.e. we really would like to get a good sense of who else is there with us)
  • when it broadcasts the reinforced-REMOVEs, it keeps a count of how many responses it gets / how many requests made it through
  • if a majority of nodes can be reached, proceed as usual
  • if only a minority can be reached, infer / detect that there has been a partition and that we are in the minority side

I think there could be a way for the system to detect that there has been a partition and go into "partition consensus" mode. For example I think that by counting the reinforced-REMOVEs it would be possible to confirm that there's only as many nodes left in the same side of the partition as inferred by the broadcast count. I.e. if I broadcast to m nodes and I also receive at most m reinforced-REMOVE alerts then that should give me a high confidence that we're the only ones left on that side. If I receive more than m, then a two-way partition cannot be assumed. (note that I think that reinforced-REMOVE needs to become its own alert type to be able to distinguish late REMOVEs from the reinforced ones).

Once a minority partition has been detected the minority side can proceed to shut down.

I have not yet thought this through for 3-way partitions.

manuelbernhardt avatar May 08 '20 09:05 manuelbernhardt

I'm leaning towards the core idea behind option 1 (the node leaving), but in a different way: I'd much rather the failure detector implementation have its own custom logic for a node to leave based on some heuristics. Mainly because two-way partition detection seems to be a very specific kind of failure to detect -- so it shouldn't be part of the core protocol.

The model for Rapid only assumes that messages between correct processes will eventually be delivered. Nodes in a minority partition are by definition not correct processes. We therefore can't assume the minority partition will be able to send or receive any messages at all (alerts, Fast Paxos, fallback Paxos messages or any additional messages at all). This makes it so that we can't rely on a timeout around the consensus steps failing (what if the node never even gets to the consensus step?), nor expect messages that minority partition nodes can send/receive to determine which partition they are in (because every node in the minority partition could be isolated into its own island).

The extensions to the reinforcement mechanism (like observing the number of reinforced alerts) -- IIUC -- has the same drawback: it still requires that the minority nodes are are able to receive broadcasts or other messages.

Back to leaving based on heuristics: I think we should extend the failure detector API with the ability to define custom "should I Leave" logic. As you say, this logic should be a singleton for each node (and not in separate edge failure detectors). Rapid users can experiment with any heuristic they want in there. A simple starting point for any implementation would be to do something like "I haven't heard from my observers nor subjects in a while, so maybe i'll probe some other nodes and if the situation looks bad** -- i'll voluntarily leave" -- ** for some application-specific idea of bad.

lalithsuresh avatar May 08 '20 16:05 lalithsuresh

Yes, having that mechanism be pluggable definitely is something I had in mind as well. The reason I want to involve the multi-node cut detector is because I like the fact that it acts as a qualitative filter to instability. As I see it, using the state of the multi-node cut detector as a trigger to the "should I leave" logic would be a superior approach to simply relying on the single-node view of observers and subjects which may be subject to glitches at times.

From there on, the "should I leave" logic could be provided with:

  • which observers have pinged the node
  • which subjects have been responsive

in the last 2*T (where T is a failure-detector interval).

Based on that input, the custom logic could apply its own filter to decide whether to probe further.

This could be triggered as soon as a subject has been in unstable reporting mode for some timeout (for example, the same timeout as for starting reinforcement).

manuelbernhardt avatar May 11 '20 08:05 manuelbernhardt

Sounds good. One way to do this is to make the "should I leave logic" get a notification when an alert modifies the cut-detector (the notification can be parameterized by the alert message itself, the configuration ID and a read-only view of the cut-detector).

lalithsuresh avatar May 11 '20 15:05 lalithsuresh