openraft icon indicating copy to clipboard operation
openraft copied to clipboard

Suitable for Geo-Replicated Clusters?

Open zicklag opened this issue 2 years ago • 9 comments

Hey there!

I'm looking for a raft solution as a component for a distributed application that I intend to be suitable for geo-replicated deployments.

I've noticed that the lol raft implementation supports something they call a Phi Accrual Failure Detector that is supposed to help with geo-replicated deployments and I was wondering whether or openraft has a similar mechanism, or else just doesn't need it.

I would rather use openraft at this point because I want to implement storage and networking myselfl, but Geo-replication is a high priority for me.

I appreciate any insight, thanks!

zicklag avatar Mar 06 '22 22:03 zicklag

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

github-actions[bot] avatar Mar 06 '22 22:03 github-actions[bot]

@zicklag I'm contributing stuff to the openraft repo, since we want to use it in our project as well (and want to have it rock-solid and fast). There, it will be also used in geo-replicated settings, but only in a later phase of the project. So at the latest then, we'll have to invest to build any required features, if not there by then 🙂. But, don't hold your breath, that's at least one year away.

Maybe the current maintainers have a roadmap for the geo-replication support?

Or maybe you can contribute it yourself?

schreter avatar Mar 06 '22 22:03 schreter

Thanks for the reply @schreter. As long as the possibility for geo-replication isn't totally closed for the future I should be fine.

My project is still ultra-experimental and everything is in flux so it's fine if geo-replication isn't available immediately.

zicklag avatar Mar 06 '22 23:03 zicklag

A failure detector such as Phi Accrual Failure Detector should not be hard to integrate into a consensus protocol. I'll dig a bit deeper into it to figure it out.

BTW, to me, geo-replication is quite a big topic. :-O.

It's about how to commit command in one RTT(the traditional 2 RTT protocol, e.g., client-leader-follower-leader-client is too expensive in geo-replication), such as multi-leader protocol, or the tradeoff between consistency and delay(such as CRDT or half CRDT protocol), or how to reduce conflicts between commands(raft assumes every two commands conflict with each other and are not interchangeable), or how to synchronize clocks across a continent.

drmingdrmer avatar Mar 07 '22 12:03 drmingdrmer

BTW, to me, geo-replication is quite a big topic.

That's why we are not going for it in the first shot :-).

Anyway, that's also why we chose the QUIC protocol (it's UDP-based, better suited to be configured differently for local and remote settings).

In fact, I wanted to look at the Network APIs later, since currently Raft simply stupidly re-sends all the log which was not applied at the remote side after a timeout. But, in fact, timeout is not the right thing. We need to detect the connection failure. I.e., the APIs should be changed somehow to reflect that, to detect connection failure instead and re-send only if the connection is gone. In the simplest case, of course, this would be the traditional Raft timeout, which can be used with stateless protocols (like sending the logs over a one-shot HTTP request). But for more elaborate use cases (and geo-replication is one of them), we should probably look into this direction. Of course, it's not a plain Raft afterwards, but who cares :-).

schreter avatar Mar 07 '22 12:03 schreter

But, in fact, timeout is not the right thing.

Agree. The replication timeout error should be removed. A leader should always keep sending logs until an explicit network error occurs or the leadership is taken by another node.

drmingdrmer avatar Mar 07 '22 15:03 drmingdrmer

Of course, it's not a plain Raft afterwards, but who cares :-).

From the README it sounds like improving upon raft is part of the motivation for the library, so it sounds like it's within the vision of the project. :)

BTW, to me, geo-replication is quite a big topic. :-O.

Fair enough. I'm not 100% savvy to distributed system implementation yet so I wasn't sure if raft was stable enough with higher latencies, but it sounds like there's some work to make it feasible.

or the tradeoff between consistency and delay(such as CRDT or half CRDT protocol)

That's actually the direction I'm leaning towards now and I'm thinking about using a custom CRDT replication algorithm instead of Raft, that uses an asynchronous majority vote of some sort to create consistant "snapshots" of the state that can be used when a consistant read is required. I'm not sure where I'll land yet, I'm still in the research and discovery phase so everything's up in the air.

Thanks for your guys's comments!

zicklag avatar Mar 08 '22 01:03 zicklag

Fair enough. I'm not 100% savvy to distributed system implementation yet so I wasn't sure if raft was stable enough with higher latencies, but it sounds like there's some work to make it feasible.

AFAIK, the original raft with pre-vote is stable with high latency. Since raft always guarantees data safety, here stable means the leader won't change frequently due to the unstable latency.

Without pre-vote, raft leader stickiness won't be guaranteed.

openraft is trying to solve the leader stickiness problem in another way. Thus meanwhile it is not stable with high latency.

The idea in short is to use the log as a measure of time, and not to increase the term unless one has enough logs: https://github.com/datafuselabs/openraft/discussions/15

The first step is:

  • #154

drmingdrmer avatar Mar 08 '22 02:03 drmingdrmer

After digging into lol raft, I found that the failure detector is only used to determine when to start the election.

In openraft, it's done with a timer: https://github.com/datafuselabs/openraft/blob/70689b76c92b5f9cab8ac179eda49953241311d0/openraft/src/core/mod.rs#L1074-L1090

In lol it periodically(100ms) calculates the failure rank φ by the time(x) since the last heartbeat was received. If φ is big enough, start the election.

Periodical checking is a little bit clumsy. To use the phi-detector, it's better to determine a time interval(x) to sleep from the max acceptable failure rank φ, i.e., the inversed process used in lol raft.

Calculating φ from x is given as:

formula

formula

formula

formula

formula

The inverse should not be difficult.

Another thing to do is to feed the heartbeat event into the phi-detector.

drmingdrmer avatar Mar 10 '22 12:03 drmingdrmer