rocksplicator Add Upstream Validator

Upstream Validation by Followers

Once in a while , we have noticed that followers do not get the right upstream set up. We believe that these are because of messages from the Controller being missed dude to race conditions. We will identify and fix those issues separately.

Solution

If the upstream is different from the one specified in the Cluster Map then set it explicitly. For now do this only for the followers whose sequence number does not change

Design

UpstreamValidator will be a member of RocksDBReplicator
UpstreamValidator will run a validation task once every N seconds ( say N=60)
Track the LatestSequenceNumber when the validator runs
Identify dormant followers (no change in seq no)
Check if upstream of dormant followers matches with the cluster map
Keep track the runcount of the thread
Reset the upstream address & re-init the client pool if necessary

Dec 03 '21 23:12 premkumr

@premkumr can you please also share the doc for this feature?

What would be the consequence if the shard map content is stale or incorrect for some reason?

Dec 06 '21 22:12 newpoo

Doc ..

We only check / reset the followers who are not replicating. So if the shardmap is stale, we will try to reset only the erroneous followers & once the correct shardmap is dispatched, it will be re-applied if the errors persist

Dec 06 '21 23:12 premkumr

Doc ..

We only check / reset the followers who are not replicating. So if the shardmap is stale, we will try to reset only the erroneous followers & once the correct shardmap is dispatched, it will be re-applied if the errors persist

I left a couple questions in the doc just for clarification. It would be good to explain more on what's the current scheme of setting an upstream and problem with it.

On this particular method, it seems we already have a resetUpstream method (though seems unused) which checks the leader by quering Helix directly, but the new upstream validator introduced in this PR check the shard map file instead (which is another dependency to deal with, and can possibly conflict with the Helix state). Do we need two approaches of checking where the leader is?

Dec 07 '21 20:12 jaricftw

which checks the leader by quering Helix directly, but the new upstream validator introduced in this PR check the shard map file instead ..

The shardmap is actually generated and stored in ZK & dispatched to all the hosts.. The helix api directly queries the ZK & could overload ZK. Even in the current resetUpstream , we have taken efforts not to overload zk ( also that path can be hit too often).

Loading from the local disk is simpler and safer. We will also add comparison between the shardmap returned from Helix APIs and the local one , in the next versions.. Also we have a tool(shardmaptool) that will trace the history of the shardmap that were copied to the local disks over time.

Dec 09 '21 03:12 premkumr

rocksplicator rocksplicator copied to clipboard

Add Upstream Validator

Upstream Validation by Followers

Solution

Design

rocksplicator
rocksplicator copied to clipboard