ra icon indicating copy to clipboard operation
ra copied to clipboard

Non-voting members for stand-by backups

Open benoitc opened this issue 6 years ago • 12 comments

For a service I would like to have nodes that follow any events on the cluster without participating to the election. Main idea behind that is to create hot backups or simply nodes actings as readers.

It seems that members of a cluster can already receive any events using the state_enter/2 callback, so i'm thinking to implement it by adding a watcher role to a node when it is added to the cluster. The node would then receive any events from the replication but can't participate to the election.

API would be similar to add_member with the addition of 3 functions: add_watcher/{2,3}, remove_watcher/{2,3}, watchers/{1,2}

Thoughts?

benoitc avatar Dec 27 '18 10:12 benoitc

So in other words such nodes would permanently be followers?

I remember our team discussed something similar at some point. I don’t think RabbitMQ has any immediate uses for it but this can be useful.

michaelklishin avatar Dec 27 '18 10:12 michaelklishin

So in other words such nodes would permanently be followers?

exactly ^^

I will try to come with an implementation ASAP :)

benoitc avatar Dec 27 '18 11:12 benoitc

Yes this is a feature we have discussed and would have uses for. :) There are a few things that I envisaged considering around a pure "watcher" member feature:

  • Only committed entries are replicated.
  • Replication may not be done by the leader (as the leader is a performance bottleneck)
  • They may not run the state machine logic (for pure backup processes this isn't necessary) or they may run a different kind of state machine that calculate a different state for some specific purpose.
  • They may not even be full ra servers - say for example you just need a sink process that batches entries and then replicates them over WAN.

kjnilsson avatar Dec 27 '18 11:12 kjnilsson

@kjnilsson I was thinking about a non voter node that could potentially be promoted as a voter node. This had the following advantages:

  • reuse the current replication logic, but the non voter follower would not declare itself as a candidate (by not triggering the election timeout)
  • because it's using the whole replication, the latency will be the same
  • eventually if needed a node could be promoted as a voter later. When for example some followers are down, or we want to increase the number of replicated node or when we want to switch some machines.

The change seems straight forward but i may miss some parts: It seems the election timeout is handled only in the ra_server_proc module an d may simply be done by modifying the state record and check for it in maybe_set_election_timeout/2 and election_timeout_action/2

But your proposal sounds more performant (less noise on the network). Maybe promoting a node to a full ra serve can also be done by simplifying the logic and done on demand ? It would only require to have the whole log replicated. Like you're suggesting they may not be full ra-servers but only processes that subscribe to the log and let one of the follower actually send it using the following process:

1. Subscription is sent to the leader
2. The leader pick one of the follower and forward  the subscription to it
3. the follower monitor the  watcher
4. the follower acknowledge the subscription
5. The follower ID is returned to the subscriber so it can monitor it

Once ACKed the follower will start to send the logs and the watcher enter in a recovery phase. If the subscription failed (because the follower came down or maybe refused it), the watcher will get a return and may retry the subscription to not block the leader.

Thoughts?

Questions:

  • what happen when a new member is added? How does it get the latest state?

benoitc avatar Dec 27 '18 16:12 benoitc

Log reinstallation would have to happen from the leader since it's something that Raft covers and Ra already does.

I vote for a minimalistic but possibly pluggable interfaces which followers would use to ship log entries to watchers. For some scenarios using large batches, extra compression and so on would be perfectly acceptable.

michaelklishin avatar Dec 27 '18 18:12 michaelklishin

@michaelklishin what do you mean by pluggable interface, something which a follower would forward the messages like i described ? How would you reinstall the logs then?

benoitc avatar Dec 27 '18 19:12 benoitc

@benoitc the interface would be used to transfer log chunks. Reinstallation perhaps should still happen from the leader (as it does right now). My point is that the way leader does replication and the way followers would replicate to watchers do not have to be the same as data safety expectations are often different for those cases.

michaelklishin avatar Dec 27 '18 19:12 michaelklishin

The Raft literature already has the concept of a non-voting member (§ 4.2.1 in the thesis) and I think that is pretty much what is being proposed here which is a reasonably feature IMO and one that could potentially be useful in other contexts although I feel a more pluggable approach may be more flexible.

In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which a new server joins the cluster as a non-voting member. The leader replicates log entries to it, but it is not yet counted towards majorities for voting or commitment purposes. Once the new server has caught up with the rest of the cluster, the reconfiguration can proceed as described above.

I'd like to see $4.2.1 in the thesis (excerpt above) implemented first which can then be extended to support non-voting members that never become full members.

kjnilsson avatar Dec 27 '18 19:12 kjnilsson

i will take care of it next week as it's useful in my case. Implementing that part of the thesis should be easy

benoitc avatar Dec 28 '18 12:12 benoitc

I've been thinking about a design for this that would include replication outside of the immediate distributed erlang mesh (to a different data centre for example). I don't want to make too many changes inside Ra to support this feature so it should ideally be a separate, optional component.

The first step would be to implement non-voting members like this issue is proposing. This would be a change in Ra.

That would allow us to join arbitrary non-voting processes to the cluster and begin to receive append entry rpcs and snapshots. To support replication over, say, a separate tcp connection or udp we'd make this non voting member a proxy process that forwards traffic to another proxy process that in turns forwards it to the actual, remote Ra non-voting process. This proxy process could be multiplexing for several Ra clusters or be per remote link.

One challenge of this proxy process is that it should ideally be on the same node as the leader to avoid a potential network hop before the proxying. This does add some complexity so could be solved at a later point.

With this design this could be a separate application providing the proxying support and some convenient API to drive it.

Finally we'd need a means of promoting a non-voting member to either a full member or, in the case of remote replication, the non-voting member should become the seed of a new cluster (in a separate DC for example).

kjnilsson avatar Jun 05 '19 11:06 kjnilsson

One challenge of this proxy process is that it should ideally be on the same node as the leader to avoid a potential network hop before the proxying. This does add some complexity so could be solved at a later point.

Which kind of complexity are we talking about?

benoitc avatar Jun 13 '19 08:06 benoitc

@benoitc node migration to where the leader currently is, for example.

michaelklishin avatar Jun 13 '19 11:06 michaelklishin

Spurred by rabbitmq/rabbitmq-server#7547, I'd be up to taking a stab at this in the pure §4.2.1 form. If that's okay, let's extract Raft "federation" and permanent watchers into separate issues?

illotum avatar Apr 19 '23 04:04 illotum

@illotum sure, no objections to a larger number of more focussed issues.

michaelklishin avatar Apr 19 '23 04:04 michaelklishin

It took me some time to catch up on the Ra internals, and Raft thesis. Being new to the library, @kjnilsson, is this what you had in mind?

The implementation will

  1. Add a new command, $ra_standby. Semantically similar to $ra_join, the leader will maintain a corresponding list of ServerIds. $ra_join will accept ServerId from the standby list, and vice versa;
  2. Make leader measure replication rounds to standby nodes. $ra_join will error if the standby node does not fulfill the timeliness condition. It will not fail for non-standby (new) servers.
  3. Add a flag to optionally "auto-join" standby servers to $ra_standby command and the state.
  4. Sprinkle all this with the API and corresponding statem changes.

illotum avatar Apr 26 '23 00:04 illotum

I don't understand the idea in 2. Making the leader replicate to such standby (witness) nodes is fine but they have to be excluded from elections. It's a good question whether they should be counted towards the quorum, if the idea is to use them for backups, then likely no.

But what would be the end goal of such node? It will have a Raft log but… how does one recover it? It promote such a node to a regular one?

Witness replicas are used differently in different systems. I don't think the Raft paper talks about them, does the thesis?

michaelklishin avatar Apr 26 '23 16:04 michaelklishin

Thesis is fairly clear on the high-level design:

  1. Nodes in catch-up mode will not count towards the majority for election or commit purposes. They will still of course respond to the AppendLog RPC.
  2. Leader will split replication to those nodes into rounds, and measure time to complete a round. If the time is within the election timeout (150-300ms), the node is considered to be eligible to become a regular member. In short: the node is syncing data quickly enough. This check is only a suggestion and can be replaced with any other eligibility criteria.
  3. Thesis describes automatic one-way promotion, and only briefly mentions that it's possible to use catch-up state for other cases (generic witness nodes).

So this is how it may look in Ra:

  • $ra_join is an existing command to add regular members. I'm proposing to overload it: in addition to existing function it will transition standby nodes into regular status. In both cases, the node becomes part of the Raft configuration.
  • $ra_standby is a new command, to add standby member. Overloaded as well: in addition to adding new nodes, it will also demote regular cluster members into standby mode.

This gives us clear semantics and full (manual) control over the node status.

On top of those changes, I'm proposing to add an opt-in call to apply $ra_join automatically. Whether upon the condition described in thesis (2), or something else -- we can make it pluggable. This seems to be the trickiest part, likely a separate command in the log.

illotum avatar Apr 26 '23 17:04 illotum

Reshuffling my proposal a bit, here it is depicted with automatic promotion a-la hashicorp/raft version, in two flavours.

   ┌───────────┐      ┌ ─ ─ ─ ─ ─ ┐                         
───▶  staging  │──────◈ follower                            
   └────▲──────┘      │           │                         
        │  │            candidate                           
   ┌───────▽───┐      │           │                         
───▷  standby  ◁──────   leader                             
   └───────────┘      └ ─ ─ ─ ─ ─ ┘                         
▶ $ra_join                     ▶ $ra_attempt_join           
▷ $ra_standby                  ▷ $ra_standby                
◈ $ra_participate, automatic   ◈ $ra_join, automatic                                    

The two flavours (see legend) imply different changes in statem and code, but if I understand it right both are compatible with existing logs.

State-wise, we need to hold "voting state" for every serverId. I originally wanted to just add another list of non-voters, but since we now have three states a map is perhaps more appropriate.

-type voter_state() :: staging | standby | voter.

An optional ra_machine callback can specify the promotion condition func (default to immediate?).

illotum avatar Apr 26 '23 20:04 illotum

@illotum thank you for elaborating. That sounds reasonable. Cool ASCII diagram, too, did you hand craft it or is there a tool that can draw them given some inputs? :)

michaelklishin avatar Apr 27 '23 06:04 michaelklishin

Ah, my secret weapon, Monodraw!

illotum avatar Apr 27 '23 17:04 illotum

Hi,

We are also working on the same problem and wanted to share our current state. Maybe it will be useful.

We think there are some open questions:

  1. What should happen if the old Active Leader tries to rejoin to the Now-Active set? In our opinion both sides should stop and the operator should resolve the inconsistency by deleting the Old Active side. We are not sure though how to enforce this at the moment.
  2. We think it would be useful to introduce a new state to allow cluster changes. This would avoid unnecessarily triggering votes from the Now-Active set, though it should work without that.

Let us know if you have any suggestions or comments.

luos avatar Apr 28 '23 13:04 luos

That's great news! Your implementation and terminology is a departure from my mental model, so forgive if I miss something.

  1. Shouldn't the old leader be on the old term by then?
  2. Agree about the new state, assuming you mean "I am passive" state or similar.

illotum avatar May 01 '23 21:05 illotum

We've tried to respond in https://github.com/rabbitmq/ra/pull/367 to all your questions. :-)

luos avatar May 02 '23 14:05 luos

This has been implemented in #375

kjnilsson avatar Mar 01 '24 17:03 kjnilsson