rocksplicator icon indicating copy to clipboard operation
rocksplicator copied to clipboard

Add Upstream Validator

Open premkumr opened this issue 3 years ago • 4 comments

Upstream Validation by Followers

Once in a while , we have noticed that followers do not get the right upstream set up. We believe that these are because of messages from the Controller being missed dude to race conditions. We will identify and fix those issues separately.

Solution

If the upstream is different from the one specified in the Cluster Map then set it explicitly. For now do this only for the followers whose sequence number does not change

Design

  • UpstreamValidator will be a member of RocksDBReplicator
  • UpstreamValidator will run a validation task once every N seconds ( say N=60)
  • Track the LatestSequenceNumber when the validator runs
  • Identify dormant followers (no change in seq no)
  • Check if upstream of dormant followers matches with the cluster map
  • Keep track the runcount of the thread
  • Reset the upstream address & re-init the client pool if necessary

premkumr avatar Dec 03 '21 23:12 premkumr

@premkumr can you please also share the doc for this feature?

What would be the consequence if the shard map content is stale or incorrect for some reason?

newpoo avatar Dec 06 '21 22:12 newpoo

Doc ..

We only check / reset the followers who are not replicating. So if the shardmap is stale, we will try to reset only the erroneous followers & once the correct shardmap is dispatched, it will be re-applied if the errors persist

premkumr avatar Dec 06 '21 23:12 premkumr

Doc ..

We only check / reset the followers who are not replicating. So if the shardmap is stale, we will try to reset only the erroneous followers & once the correct shardmap is dispatched, it will be re-applied if the errors persist

I left a couple questions in the doc just for clarification. It would be good to explain more on what's the current scheme of setting an upstream and problem with it.

On this particular method, it seems we already have a resetUpstream method (though seems unused) which checks the leader by quering Helix directly, but the new upstream validator introduced in this PR check the shard map file instead (which is another dependency to deal with, and can possibly conflict with the Helix state). Do we need two approaches of checking where the leader is?

jaricftw avatar Dec 07 '21 20:12 jaricftw

which checks the leader by quering Helix directly, but the new upstream validator introduced in this PR check the shard map file instead ..

The shardmap is actually generated and stored in ZK & dispatched to all the hosts.. The helix api directly queries the ZK & could overload ZK. Even in the current resetUpstream , we have taken efforts not to overload zk ( also that path can be hit too often).

Loading from the local disk is simpler and safer. We will also add comparison between the shardmap returned from Helix APIs and the local one , in the next versions.. Also we have a tool(shardmaptool) that will trace the history of the shardmap that were copied to the local disks over time.

premkumr avatar Dec 09 '21 03:12 premkumr