replication: `healthy_count` is outdated / lagging
replicaset.healthy_count seems to be updated incorrectly sometimes. Or too late. Here is a test:
--
-- Instance 1
--
-- Step 1
--
fiber = require('fiber')
log = require('log')
box.cfg{
listen = 3313,
replication = {3313, 3314, 3315},
replication_synchro_timeout = 1000,
election_mode = 'candidate',
replication_synchro_quorum = 2,
replication_timeout = 1000,
replication_reconnect_timeout = 1,
}
box.ctl.promote()
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')
--
-- Step 4
--
box.cfg{replication = {}}
--
-- Instance 2
--
-- Step 2
--
fiber = require('fiber')
log = require('log')
box.cfg{
listen = 3314,
replication = {3313, 3314, 3315},
replication_synchro_timeout = 1000,
election_mode = 'voter',
replication_synchro_quorum = 2,
replication_timeout = 1000,
replication_reconnect_timeout = 1,
}
--
-- Step 5
--
box.cfg{replication = {3314, 3315}}
--
-- Instance 3
--
-- Step 3
--
fiber = require('fiber')
log = require('log')
box.cfg{
listen = 3315,
replication = {3313, 3314, 3315},
replication_synchro_timeout = 1000,
election_mode = 'voter',
election_timeout = 1000,
replication_synchro_quorum = 2,
replication_timeout = 1000,
replication_reconnect_timeout = 1,
}
--
-- Step 6
--
box.cfg{replication = {3314, 3315}}
box.cfg{election_mode = 'candidate'}
box.ctl.promote()
On Step 6 the Instance-3 has itself and Instance-2 connected perfectly to each other. And yet this promote will fail with
---
- error: 'Not enough peers connected to start elections: 1 out of minimal required
2'
...
which is clearly not true.
At the same time, the problem miraculously goes away if the Instance-1 does 2 transactions after all is bootstrapped, but before it gets disconnected. This example below works ok:
--
-- Instance 1
--
-- Step 1
--
fiber = require('fiber')
log = require('log')
box.cfg{
listen = 3313,
replication = {3313, 3314, 3315},
replication_synchro_timeout = 1000,
election_mode = 'candidate',
replication_synchro_quorum = 2,
replication_timeout = 1000,
replication_reconnect_timeout = 1,
}
box.ctl.promote()
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')
--
-- Step 4
--
s = box.schema.create_space('test')
-- Wait a bit.
fiber.sleep(1)
_ = s:create_index('pk')
-- Wait a bit.
fiber.sleep(1)
box.cfg{replication = {}}
--
-- Instance 2
--
-- Step 2
--
fiber = require('fiber')
log = require('log')
box.cfg{
listen = 3314,
replication = {3313, 3314, 3315},
replication_synchro_timeout = 1000,
election_mode = 'voter',
replication_synchro_quorum = 2,
replication_timeout = 1000,
replication_reconnect_timeout = 1,
}
--
-- Step 5
--
box.cfg{replication = {3314, 3315}}
--
-- Instance 3
--
-- Step 3
--
fiber = require('fiber')
log = require('log')
box.cfg{
listen = 3315,
replication = {3313, 3314, 3315},
replication_synchro_timeout = 1000,
election_mode = 'voter',
election_timeout = 1000,
replication_synchro_quorum = 2,
replication_timeout = 1000,
replication_reconnect_timeout = 1,
}
--
-- Step 6
--
box.cfg{replication = {3314, 3315}}
box.cfg{election_mode = 'candidate'}
box.ctl.promote()
The only difference is that Instance-1 does 2 random transactions, slowly, giving each of them time to get delivered. And only then breaks the replication with the others.
This doesn't look correct.
Not necessarily a solution, but it feels like the attempt to optimize replicaset.healthy_count calculation in replica_update_applier_health() and replica_update_relay_health() by updating the count incrementally doesn't pay off. It looks complex.
Might be easier to start with a refactoring where all these incremental changes are removed and replaced with a single big replication_update_states() function, which would calculate all these counts and states and whatever else from scratch in one go by fullscaning the replicas array/hashtable.
It can't be that expensive. Replica states are changed rarely, and are always accompanied by much bigger overheads like TCP reconnects, which would dwarf any counter bumps and fullscans of a max 32 items array. But the logic in the code I bet would become much simpler.
As "all other states" I mean healthy_count, anon_count, registered_count, replicaset.applier.connected, replicaset.applier.loading, replicaset.applier.synced, replicaset.applier.total. Maybe more. All of these are updated individually and on myriads of various events. It is very easy to mess some counters up.