bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

[fix] remove in address2Region while bookie left to get correct rack info

Open ethqunzhong opened this issue 5 months ago • 0 comments

Motivation

we use RegionAwareEnsemblePlacementPolicy in our pulsar cluster We encountered some unexpected issues. (In some situation, eg, Broker and bookie restart concurrently.)

  1. Bookie X join cluster for the first time, encounters a region exception, and address2Region record X's region as default-region.
  2. Bookie X left cluster and is removed from knownBookies, but address2Region retains the information of bookie X.
  3. update Bookie X's rack info, and calling onBookieRackChange will only update address2Region for addresses present in knownBookies; therefore, bookie X's region info is not updated.
  4. Bookie X join cluster again, since address2Region contains the previous default-region information, getRegion will directly use cached data, resulting of an incorrect region.

which may cause traffic skew in ensemble selection, Causing the bookie disk to be filled up quickly. image

Changes

We should ensure that when a bookie leaves the cluster, we also clean up the corresponding region information for that bookie in address2Region, so that it can update the correct region for the bookie during onBookieRackChange and handleBookiesThatJoined. do leftBookies.forEach(address2Region::remove) in handleBookiesThatLeft

ethqunzhong avatar Sep 12 '24 04:09 ethqunzhong