bookkeeper
bookkeeper copied to clipboard
[improve] decommissionBookie waiting for ledgers to be replicated back off policy
Master Issue: #3338
Motivation
As the Decommissioning bookie case, always change the bookie status to readonly firstly, and then wait some data expired, but always it has some ledgers (about 100+ -- 300+) legacy not be cleaned and the leaved ledgers only has little data , when we running bin/bookkeeper shell decommissionbookie -bookieid
to decommission the bookie , we always pending on waitForLedgersToBeReplicated()
about 10 min and have not any log print, but we could find the znode /ledgers/underreplication/ledgers cleaned only few seconds and then the ledgers be rereplicate completed, we find the sleep time is defined as the Min(ledgers.size() * sleepTimePerLedger(10s) , maxSleepTimeInBetweenChecks(10 min))
, in the way , the both time always wait too long and has not any print will let user confused。
Changes
In the bookie has many data to rereplicate case , we need the backoff policy to protect zk server, so
- To look the
/ledgers/underreplication/ledgers
and/ledgers/underreplication/locks
every 10 sec, help us check if the ledgers replicate completed. - To avoid the auditor is running as CheckAllLedgers or other time-consuming operation when we trigger the audit bookie, then
/ledgers/underreplication/ledgers
and/ledgers/underreplication/locks
will empty to long time , we need a back off policy to avoid frequent request zk tovalidateBookieIsNotPartOfEnsemble
@eolivelli @dlg99 Could you help to check the logic is correct?
fix old workflow,please see #3455 for detail
Close this since it is open for a long time without any updates. Feel free to reopen it if you want to continue