foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Improve FDB reliability when localities are misconfigured and later corrected

Open xumengpanda opened this issue 5 years ago • 3 comments

If a storage server (SS) does not have a valid locality due to misconfiguration, the replication policy can select replicas that do not satisfy the policy. For example, in three_data_hall mode, if a server is not configured with data_hall locality, the selectReplicas() may create a server team whose size is not equal to the replica factor.

Another issue is that a SS may be chosen as the preferred server and feed into selectReplicas(), although having the SS will never create a valid team. For example, in three_data_center mode, if a DC has only one server and the server is chosen as the must-have one for a team. selectReplicas() will not be able to create such a team. addTeamsBest() may get stuck there.

Although this only happens in misconfiguration, DD should better prevent itself from the problem using the following solution:

  1. If a SS does not have a valid locality configuration under a replication policy, it should not be used in building teams -- it should be treated as always unhealthy. It should also create a trace event to notify the system operator;
  2. We should add test cases in simulation to cover these situations: three_data_hall mode, and three_data_center mode.

xumengpanda avatar Sep 12 '19 23:09 xumengpanda