foundationdb
foundationdb copied to clipboard
Improve FDB reliability when localities are misconfigured and later corrected
If a storage server (SS) does not have a valid locality due to misconfiguration, the replication policy can select replicas that do not satisfy the policy. For example, in three_data_hall
mode, if a server is not configured with data_hall
locality, the selectReplicas()
may create a server team whose size is not equal to the replica factor.
Another issue is that a SS may be chosen as the preferred server and feed into selectReplicas()
, although having the SS will never create a valid team. For example, in three_data_center
mode, if a DC has only one server and the server is chosen as the must-have one for a team. selectReplicas()
will not be able to create such a team. addTeamsBest()
may get stuck there.
Although this only happens in misconfiguration, DD should better prevent itself from the problem using the following solution:
- If a SS does not have a valid locality configuration under a replication policy, it should not be used in building teams -- it should be treated as always unhealthy. It should also create a trace event to notify the system operator;
- We should add test cases in simulation to cover these situations:
three_data_hall
mode, andthree_data_center
mode.