Gringofts icon indicating copy to clipboard operation
Gringofts copied to clipboard

how to improve high-availability when waitTillLeaderIsReadyOrStepDown

Open jackjoesh opened this issue 3 years ago • 10 comments

If the leader needs to be really ready, it needs to apply all business data successfully after the election. During this period of time, the leader cannot actually provide services. In Gringofts, the method waitTillLeaderIsReadyOrStepDown will be waiting. If a lot of data needs to be applied, it will wait for a long time. Does Gringofts have some high-availability solutions that can minimize the time when services are unavailable? Thank you.

jackjoesh avatar Apr 01 '22 15:04 jackjoesh

Hi thanks for your interests. This step wait for two things, one is leader on stage and the second is applying the the lasted log.

  • The first one is very quick if Raft cluster is health.
  • For the second one,
    • if the instance is just startup, it might take time to recovery state from rocksdb and apply rest logs to the latest state.
    • If it is from follower to leader, we have a little trick. We have two single-thread loop, one is CPL for command processing which decodes command to event, then commits events to raft layer. Other is EAL for event applying, which applies events from raft log entry (committed) to maintain a latest state. For leader it has both two, and followers only have EAL. When a follower becomes a new leader it can directly use the EAL built state machine as its latest state. So the switch is also fast.

huyumars avatar Apr 01 '22 16:04 huyumars

Hi thanks for your interests. This step wait for two things, one is leader on stage and the second is applying the the lasted log.

  • The first one is very quick if Raft cluster is health.

  • For the second one,

    • if the instance is just startup, it might take time to recovery state from rocksdb and apply rest logs to the latest state.
    • If it is from follower to leader, we have a little trick. We have two single-thread loop, one is CPL for command processing which decodes command to event, then commits events to raft layer. Other is EAL for event applying, which applies events from raft log entry (committed) to maintain a latest state. For leader it has both two, and followers only have EAL. When a follower becomes a new leader it can directly use the EAL built state machine as its latest state. So the switch is also fast.

Thank you for your quickly apply. Yes, I'm talking about the second one. If we restartup the follower instance for release, and at this time, there happened to be a problem with the leader, and the currently restarted follower became the leader by voting. But this new leader may take a long time to load state from rocksdb snapshot apply rest logs to the latest state. In this situation, what should we do?

jackjoesh avatar Apr 02 '22 02:04 jackjoesh

Our state machine takes advantage of rocksdb. EAL only write rocksdb, and CPL read from memory cache and same shared rockdb. You can image rocksdb state is the latested commited state, which also is the EAL state. And CPL state is the rocksdb state + uncommited command state in memory. So when follower becomes to new leader, The swap processing of EAL and CPL is like CPL start to write data in memory on the base of lasted commited state in rocksdb, and EAL still write to the same rocksdb. So both of some are latest. And the recover is super fast

huyumars avatar Apr 02 '22 09:04 huyumars

Our state machine takes advantage of rocksdb. EAL only write rocksdb, and CPL read from memory cache and same shared rockdb. You can image rocksdb state is the latested commited state, which also is the EAL state. And CPL state is the rocksdb state + uncommited command state in memory. So when follower becomes to new leader, The swap processing of EAL and CPL is like CPL start to write data in memory on the base of lasted commited state in rocksdb, and EAL still write to the same rocksdb. So both of some are latest. And the recover is super fast

Thank you, I get your design point. Because we want to store idempotent and original request datas (one update may store N fund items) in rocksdb, so data is very large. I think very large data will affect the frequency of snapshots, resulting in a very large amount of data that needs to be applied on apply events. Finally, when follower to leader, the apply of EAL will also be very slow. Do your actual business store idempotent and original request data in EAL rocksdb? how to solve this problem, thank you!

jackjoesh avatar Apr 05 '22 15:04 jackjoesh

In practice, there is no extra EAL event to apply, since EAL should catch up with commited index. Yes we have PROD system using are using gringofts, that's why we open source it and maintain it. This design has already be proven in our system.

huyumars avatar Apr 07 '22 12:04 huyumars

thank you. I get it

jackjoesh avatar Apr 11 '22 08:04 jackjoesh

In practice, there is no extra EAL event to apply, since EAL should catch up with commited index. Yes we have PROD system using are using gringofts, that's why we open source it and maintain it. This design has already be proven in our system.

Last question, how max tps can support in gringofts single group(just single group, not multi group)? Thank you

jackjoesh avatar Apr 12 '22 03:04 jackjoesh

Our state machine takes advantage of rocksdb. EAL only write rocksdb, and CPL read from memory cache and same shared rockdb. You can image rocksdb state is the latested commited state, which also is the EAL state. And CPL state is the rocksdb state + uncommited command state in memory. So when follower becomes to new leader, The swap processing of EAL and CPL is like CPL start to write data in memory on the base of lasted commited state in rocksdb, and EAL still write to the same rocksdb. So both of some are latest. And the recover is super fast

Thank you, I get your design point. Because we want to store idempotent and original request datas (one update may store N fund items) in rocksdb, so data is very large. I think very large data will affect the frequency of snapshots, resulting in a very large amount of data that needs to be applied on apply events. Finally, when follower to leader, the apply of EAL will also be very slow. Do your actual business store idempotent and original request data in EAL rocksdb? how to solve this problem, thank you!

the apply function should execute quickly, and it can execute quickly, since most heavy stuff has been handled in the process function, the apply just apply the processed result.

jackyjia avatar Apr 15 '22 01:04 jackyjia

thank you. I get it

Just curious, which industry are you in and what's your use case?

jackyjia avatar Apr 15 '22 01:04 jackyjia

In practice, there is no extra EAL event to apply, since EAL should catch up with commited index. Yes we have PROD system using are using gringofts, that's why we open source it and maintain it. This design has already be proven in our system.

Last question, how max tps can support in gringofts single group(just single group, not multi group)? Thank you

It depends on several factors:

  1. how complicated the process logic is
  2. request payload
  3. network bandwidth
  4. cluster setup: usually more instances in the cluster, lower the throughput

In our production, tps is around ~8K.

jackyjia avatar Apr 15 '22 01:04 jackyjia