rocksplicator
rocksplicator copied to clipboard
Add Upstream Validator
Upstream Validation by Followers
Once in a while , we have noticed that followers do not get the right upstream set up. We believe that these are because of messages from the Controller being missed dude to race conditions. We will identify and fix those issues separately.
Solution
If the upstream is different from the one specified in the Cluster Map then set it explicitly. For now do this only for the followers whose sequence number does not change
Design
- UpstreamValidator will be a member of RocksDBReplicator
- UpstreamValidator will run a validation task once every N seconds ( say N=60)
- Track the LatestSequenceNumber when the validator runs
- Identify dormant followers (no change in seq no)
- Check if upstream of dormant followers matches with the cluster map
- Keep track the runcount of the thread
- Reset the upstream address & re-init the client pool if necessary
@premkumr can you please also share the doc for this feature?
What would be the consequence if the shard map content is stale or incorrect for some reason?
Doc ..
We only check / reset the followers who are not replicating. So if the shardmap is stale, we will try to reset only the erroneous followers & once the correct shardmap is dispatched, it will be re-applied if the errors persist
Doc ..
We only check / reset the followers who are not replicating. So if the shardmap is stale, we will try to reset only the erroneous followers & once the correct shardmap is dispatched, it will be re-applied if the errors persist
I left a couple questions in the doc just for clarification. It would be good to explain more on what's the current scheme of setting an upstream and problem with it.
On this particular method, it seems we already have a resetUpstream method (though seems unused) which checks the leader by quering Helix directly, but the new upstream validator introduced in this PR check the shard map file instead (which is another dependency to deal with, and can possibly conflict with the Helix state). Do we need two approaches of checking where the leader is?
which checks the leader by quering Helix directly, but the new upstream validator introduced in this PR check the shard map file instead ..
The shardmap is actually generated and stored in ZK & dispatched to all the hosts.. The helix api directly queries the ZK & could overload ZK. Even in the current resetUpstream , we have taken efforts not to overload zk ( also that path can be hit too often).
Loading from the local disk is simpler and safer. We will also add comparison between the shardmap returned from Helix APIs and the local one , in the next versions.. Also we have a tool(shardmaptool) that will trace the history of the shardmap that were copied to the local disks over time.