oxia icon indicating copy to clipboard operation
oxia copied to clipboard

coordinator: fix too much cursor causes by ensemble breaking

Open mattisonchao opened this issue 1 year ago • 1 comments

Motivation

failed test: https://github.com/streamnative/oxia/actions/runs/9489462225/job/26150775945

In the old logic, we count the removed node into the ensemble, which causes some issues.

  1. the fencingQuorum is breaking the expected quorum number.
  2. assign more followers to the shared leader(since we are waiting quorumFencingGracePeriod). Which will cause too many cursor exceptions and let the leader election constantly fail.

Alternative

I am unsure why we must add RemovedNodes to the quorum. If we must have a reason, we can filter it in quorumFencingGracePeriod logic. :)

mattisonchao avatar Jun 13 '24 12:06 mattisonchao

Is the effect that we have many repeated entries in the cluster status? I think I've seen that problem at some point though I was unable to replicate again.

Could you please clarify the repeated entries in cluster status? I am unsure which part you mentioned.

I tested some times. I found our leader election is not stable after introducing the node swap function, which will cause the concurrent leader election. you can simply close server 1 after you change configuration here. https://github.com/streamnative/oxia/blob/9696873987158b171c9ffe5d825571c00fff271f/coordinator/impl/coordinator_e2e_test.go#L570

some issues might happen when you run it many times.

The RC should be related to concurrent leader election in the shards controller.

  1. even though we cancelled the old context, we didn't interrupt the old goroutine. so the behaviour will be undefined. plus, our context is not accurate, and it can make things worse.

e.g: https://github.com/streamnative/oxia/blob/9696873987158b171c9ffe5d825571c00fff271f/coordinator/impl/shard_controller.go#L289 If we are using this context to control the concurrent, we might need to assign this to all the sub-operations in the election. but I suggest to use a channel to sequence operation and make it simple.

  1. if the removed node is crashed. the leader election will never done. (caused by delete shards error) https://github.com/streamnative/oxia/blob/9696873987158b171c9ffe5d825571c00fff271f/coordinator/impl/shard_controller.go#L341

etc...

Sorry, I just brought up some issues I found but still have no time to fix them. also, it's not very common and urgent. I guess we can fix it later.

mattisonchao avatar Jun 15 '24 14:06 mattisonchao