bookkeeper
bookkeeper copied to clipboard
[fix] remove in address2Region while bookie left to get correct rack info
Motivation
we use RegionAwareEnsemblePlacementPolicy in our pulsar cluster We encountered some unexpected issues. (In some situation, eg, Broker and bookie restart concurrently.)
- Bookie X join cluster for the first time, encounters a region exception, and
address2Region
record X's region as default-region. - Bookie X left cluster and is removed from knownBookies, but address2Region retains the information of bookie X.
- update Bookie X's rack info, and calling
onBookieRackChange
will only update address2Region for addresses present in knownBookies; therefore, bookie X's region info is not updated. - Bookie X join cluster again, since address2Region contains the previous default-region information, getRegion will directly use cached data, resulting of an incorrect region.
which may cause traffic skew in ensemble selection, Causing the bookie disk to be filled up quickly.
Changes
We should ensure that when a bookie leaves the cluster, we also clean up the corresponding region information for that bookie in address2Region, so that it can update the correct region for the bookie during onBookieRackChange and
handleBookiesThatJoined.
do leftBookies.forEach(address2Region::remove)
in handleBookiesThatLeft