tarantool icon indicating copy to clipboard operation
tarantool copied to clipboard

replication: `healthy_count` is outdated / lagging

Open Gerold103 opened this issue 1 month ago • 1 comments

replicaset.healthy_count seems to be updated incorrectly sometimes. Or too late. Here is a test:

--
-- Instance 1
--
-- Step 1
--
fiber = require('fiber')
log = require('log')
box.cfg{
    listen = 3313,
    replication = {3313, 3314, 3315},
    replication_synchro_timeout = 1000,
    election_mode = 'candidate',
    replication_synchro_quorum = 2,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
box.ctl.promote()
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')
--
-- Step 4
--
box.cfg{replication = {}}




--
-- Instance 2
--
-- Step 2
--
fiber = require('fiber')
log = require('log')
box.cfg{
    listen = 3314,
    replication = {3313, 3314, 3315},
    replication_synchro_timeout = 1000,
    election_mode = 'voter',
    replication_synchro_quorum = 2,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
--
-- Step 5
--
box.cfg{replication = {3314, 3315}}



--
-- Instance 3
--
-- Step 3
--
fiber = require('fiber')
log = require('log')
box.cfg{
    listen = 3315,
    replication = {3313, 3314, 3315},
    replication_synchro_timeout = 1000,
    election_mode = 'voter',
    election_timeout = 1000,
    replication_synchro_quorum = 2,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
--
-- Step 6
--
box.cfg{replication = {3314, 3315}}
box.cfg{election_mode = 'candidate'}
box.ctl.promote()

On Step 6 the Instance-3 has itself and Instance-2 connected perfectly to each other. And yet this promote will fail with

---
- error: 'Not enough peers connected to start elections: 1 out of minimal required
    2'
...

which is clearly not true.

At the same time, the problem miraculously goes away if the Instance-1 does 2 transactions after all is bootstrapped, but before it gets disconnected. This example below works ok:

--
-- Instance 1
--
-- Step 1
--
fiber = require('fiber')
log = require('log')
box.cfg{
    listen = 3313,
    replication = {3313, 3314, 3315},
    replication_synchro_timeout = 1000,
    election_mode = 'candidate',
    replication_synchro_quorum = 2,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
box.ctl.promote()
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')
--
-- Step 4
--
s = box.schema.create_space('test')
-- Wait a bit.
fiber.sleep(1)
_ = s:create_index('pk')
-- Wait a bit.
fiber.sleep(1)
box.cfg{replication = {}}




--
-- Instance 2
--
-- Step 2
--
fiber = require('fiber')
log = require('log')
box.cfg{
    listen = 3314,
    replication = {3313, 3314, 3315},
    replication_synchro_timeout = 1000,
    election_mode = 'voter',
    replication_synchro_quorum = 2,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
--
-- Step 5
--
box.cfg{replication = {3314, 3315}}



--
-- Instance 3
--
-- Step 3
--
fiber = require('fiber')
log = require('log')
box.cfg{
    listen = 3315,
    replication = {3313, 3314, 3315},
    replication_synchro_timeout = 1000,
    election_mode = 'voter',
    election_timeout = 1000,
    replication_synchro_quorum = 2,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
--
-- Step 6
--
box.cfg{replication = {3314, 3315}}
box.cfg{election_mode = 'candidate'}
box.ctl.promote()

The only difference is that Instance-1 does 2 random transactions, slowly, giving each of them time to get delivered. And only then breaks the replication with the others.

This doesn't look correct.

Gerold103 avatar Nov 10 '25 22:11 Gerold103

Not necessarily a solution, but it feels like the attempt to optimize replicaset.healthy_count calculation in replica_update_applier_health() and replica_update_relay_health() by updating the count incrementally doesn't pay off. It looks complex.

Might be easier to start with a refactoring where all these incremental changes are removed and replaced with a single big replication_update_states() function, which would calculate all these counts and states and whatever else from scratch in one go by fullscaning the replicas array/hashtable.

It can't be that expensive. Replica states are changed rarely, and are always accompanied by much bigger overheads like TCP reconnects, which would dwarf any counter bumps and fullscans of a max 32 items array. But the logic in the code I bet would become much simpler.

As "all other states" I mean healthy_count, anon_count, registered_count, replicaset.applier.connected, replicaset.applier.loading, replicaset.applier.synced, replicaset.applier.total. Maybe more. All of these are updated individually and on myriads of various events. It is very easy to mess some counters up.

Gerold103 avatar Nov 10 '25 22:11 Gerold103