coordinator: fix too much cursor causes by ensemble breaking
Motivation
failed test: https://github.com/streamnative/oxia/actions/runs/9489462225/job/26150775945
In the old logic, we count the removed node into the ensemble, which causes some issues.
- the
fencingQuorumis breaking the expected quorum number. - assign more followers to the shared leader(since we are waiting
quorumFencingGracePeriod). Which will causetoo many cursorexceptions and let the leader election constantly fail.
Alternative
I am unsure why we must add RemovedNodes to the quorum. If we must have a reason, we can filter it in quorumFencingGracePeriod logic. :)
Is the effect that we have many repeated entries in the cluster status? I think I've seen that problem at some point though I was unable to replicate again.
Could you please clarify the repeated entries in cluster status? I am unsure which part you mentioned.
I tested some times. I found our leader election is not stable after introducing the node swap function, which will cause the concurrent leader election. you can simply close server 1 after you change configuration here. https://github.com/streamnative/oxia/blob/9696873987158b171c9ffe5d825571c00fff271f/coordinator/impl/coordinator_e2e_test.go#L570
some issues might happen when you run it many times.
The RC should be related to concurrent leader election in the shards controller.
- even though we cancelled the old context, we didn't interrupt the old goroutine. so the behaviour will be undefined. plus, our context is not accurate, and it can make things worse.
e.g: https://github.com/streamnative/oxia/blob/9696873987158b171c9ffe5d825571c00fff271f/coordinator/impl/shard_controller.go#L289
If we are using this context to control the concurrent, we might need to assign this to all the sub-operations in the election. but I suggest to use a channel to sequence operation and make it simple.
- if the removed node is crashed. the leader election will never done. (caused by delete shards error) https://github.com/streamnative/oxia/blob/9696873987158b171c9ffe5d825571c00fff271f/coordinator/impl/shard_controller.go#L341
etc...
Sorry, I just brought up some issues I found but still have no time to fix them. also, it's not very common and urgent. I guess we can fix it later.