bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

[improve] decommissionBookie waiting for ledgers to be replicated back off policy

Open Nicklee007 opened this issue 2 years ago • 2 comments

Master Issue: #3338

Motivation

As the Decommissioning bookie case, always change the bookie status to readonly firstly, and then wait some data expired, but always it has some ledgers (about 100+ -- 300+) legacy not be cleaned and the leaved ledgers only has little data , when we running bin/bookkeeper shell decommissionbookie -bookieid to decommission the bookie , we always pending on waitForLedgersToBeReplicated() about 10 min and have not any log print, but we could find the znode /ledgers/underreplication/ledgers cleaned only few seconds and then the ledgers be rereplicate completed, we find the sleep time is defined as the Min(ledgers.size() * sleepTimePerLedger(10s) , maxSleepTimeInBetweenChecks(10 min)), in the way , the both time always wait too long and has not any print will let user confused。

Changes

In the bookie has many data to rereplicate case , we need the backoff policy to protect zk server, so

  1. To look the /ledgers/underreplication/ledgers and /ledgers/underreplication/locks every 10 sec, help us check if the ledgers replicate completed.
  2. To avoid the auditor is running as CheckAllLedgers or other time-consuming operation when we trigger the audit bookie, then /ledgers/underreplication/ledgers and /ledgers/underreplication/locks will empty to long time , we need a back off policy to avoid frequent request zk to validateBookieIsNotPartOfEnsemble

Nicklee007 avatar Jun 16 '22 03:06 Nicklee007

@eolivelli @dlg99 Could you help to check the logic is correct?

Nicklee007 avatar Jun 27 '22 14:06 Nicklee007

fix old workflow,please see #3455 for detail

StevenLuMT avatar Aug 24 '22 08:08 StevenLuMT

Close this since it is open for a long time without any updates. Feel free to reopen it if you want to continue

zymap avatar Dec 04 '23 03:12 zymap