bookkeeper
bookkeeper copied to clipboard
[improve] decommissionBookie always waiting too long time after ledgers be replicated completed
improve
As the Decommissioning bookie case, always change the bookie status to readonly firstly, and then wait some data expired, but always it has some ledgers (about 100+ -- 300+) legacy not be cleaned and the leaved ledgers only has little data , when we running bin/bookkeeper shell decommissionbookie -bookieid
to decommission the bookie , we always pending about 10 min and have not any log print, but we could find the znode /ledgers/underreplication/ledgers cleaned only few seconds and then the ledgers be rereplicate completed。
To Reproduce
Steps to reproduce the behavior:
- Go to change the bookie status to readonly;
- waiting the most ledger expired;
- stop bookie and run
bin/bookkeeper shell decommissionbookie -bookieid
- See will wait long time about 10 min, even the ledgers which have few data is replicated completed, and after
Count of Ledgers which need to be rereplicated:
the log printed, the 10 min have not any other printed.
Expected behavior
The waiting time not too long and tell us what happened.
I think the wait is related to https://github.com/apache/bookkeeper/blob/eadbdd4b6bfeef9924a3ff2c59fc3718cf3dc06b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeperAdmin.java#L1623
You can make this time configurable via the decommission command flag / add logging.
You can make this time configurable via the decommission command flag / add logging.
@dlg99
Yep, the wait time related to the config of maxSleepTimeInBetweenChecks
. If we add command parameter to change the maxSleepTimeInBetweenChecks
, the users maybe could not forecast how long will it take to rereplicate completed. It's a risk if user set a small maxSleepTimeInBetweenChecks
and then the auditor is running to do some time-consuming operation like checkAllLedger
could not audit bookie immediately,will cause check areEntriesOfLedgerStoredInTheBookie
through zk too frequently and affect the zk server performance.
The PR #3339 will judgment if the /ledgers/underreplication/ledgers
and /ledgers/underreplication/locks
is empty to help us check if the rereplicate is completed, and backoff when the auditor is running as CheckAllLedgers or other time-consuming operation.
could you help me check the PR, Thx.