zraft_lib icon indicating copy to clipboard operation
zraft_lib copied to clipboard

Should a light session be tolerant to failures?

Open cmeiklejohn opened this issue 6 years ago • 3 comments

I'm reading and writing using a light_session -- the leader is node_1 and everything is zipping along fine -- reads and writes. However, if node_1 is partitioned and then crashes, I see the leader transition to node_3, but writes using the light_session do not seem to complete? When I debug the messages being sent, no matter which request I issue the read from, I always see the request go to node_1, which was that last known leader, who is no longer online.

The documentation seems to indicate that, with the light session configured to know about all peers, that a request would, if timed out, be rerouted to another node in the cluster. Is this how it should work?

cmeiklejohn avatar May 06 '19 21:05 cmeiklejohn

Seems like not a lot of issues get responded to in this repository (open since 2015) - maybe you’d be better off trying https://github.com/rabbitmq/ra https://github.com/rabbitmq/ra ?

On 6 May 2019, at 22:46, Christopher S. Meiklejohn [email protected] wrote:

I'm reading and writing using a light_session -- the leader is node_1 and everything is zipping along fine -- reads and writes. However, if node_1 is partitioned and then crashes, I see the leader transition to node_3, but writes using the light_session do not seem to complete? When I debug the messages being sent, no matter which request I issue the read from, I always see the request go to node_1, which was that last known leader, who is no longer online.

The documentation seems to indicate that, with the light session configured to know about all peers, that a request would, if timed out, be rerouted to another node in the cluster. Is this how it should work?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dreyk/zraft_lib/issues/33, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUCR5XEKQLVMZLZPDSVLWDPUCRKFANCNFSM4HLDVDLA.

bryanhuntesl avatar May 07 '19 08:05 bryanhuntesl

We're testing a variety of software as part of a research project, and this was one of the libraries we selected.

cmeiklejohn avatar May 07 '19 14:05 cmeiklejohn

@cmeiklejohn Yes. light_session should be tolerated to a node fails. It knows about all peers and should try to use other nodes to get a new leader.

dreyk avatar May 07 '19 15:05 dreyk