bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

[improve] decommissionBookie always waiting too long time after ledgers be replicated completed

Open Nicklee007 opened this issue 2 years ago • 2 comments

improve

As the Decommissioning bookie case, always change the bookie status to readonly firstly, and then wait some data expired, but always it has some ledgers (about 100+ -- 300+) legacy not be cleaned and the leaved ledgers only has little data , when we running bin/bookkeeper shell decommissionbookie -bookieid to decommission the bookie , we always pending about 10 min and have not any log print, but we could find the znode /ledgers/underreplication/ledgers cleaned only few seconds and then the ledgers be rereplicate completed。

To Reproduce

Steps to reproduce the behavior:

  1. Go to change the bookie status to readonly;
  2. waiting the most ledger expired;
  3. stop bookie and run bin/bookkeeper shell decommissionbookie -bookieid
  4. See will wait long time about 10 min, even the ledgers which have few data is replicated completed, and after Count of Ledgers which need to be rereplicated: the log printed, the 10 min have not any other printed.

Expected behavior

The waiting time not too long and tell us what happened.

Nicklee007 avatar Jun 16 '22 03:06 Nicklee007

I think the wait is related to https://github.com/apache/bookkeeper/blob/eadbdd4b6bfeef9924a3ff2c59fc3718cf3dc06b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeperAdmin.java#L1623

You can make this time configurable via the decommission command flag / add logging.

dlg99 avatar Jun 21 '22 21:06 dlg99

You can make this time configurable via the decommission command flag / add logging.

@dlg99 Yep, the wait time related to the config of maxSleepTimeInBetweenChecks. If we add command parameter to change the maxSleepTimeInBetweenChecks, the users maybe could not forecast how long will it take to rereplicate completed. It's a risk if user set a small maxSleepTimeInBetweenChecks and then the auditor is running to do some time-consuming operation like checkAllLedger could not audit bookie immediately,will cause check areEntriesOfLedgerStoredInTheBookie through zk too frequently and affect the zk server performance.

The PR #3339 will judgment if the /ledgers/underreplication/ledgers and /ledgers/underreplication/locks is empty to help us check if the rereplicate is completed, and backoff when the auditor is running as CheckAllLedgers or other time-consuming operation. could you help me check the PR, Thx.

Nicklee007 avatar Jun 22 '22 02:06 Nicklee007