Do not skip opened ledger in repair not adhering placement ledger
Motivation
In order to completely fix not adhering placement ledgers, also repair the opened ledger in feature "auto recover support repaired not adhering placement ledger"
Changes
remove "if (!metadata.isClosed())" in two function.
It is ok to recover opened ledger because the ledger would become fence when do recovery.
Master Issue: #3971
@horizonzy can you take a look of this ?
@horizonzy Please help take a look, thanks.
It is ok to recover opened ledger because the ledger would become fence when do recovery.
I didn't notice this logic, can you help describe it?
I didn't notice this logic, can you help describe it?
When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?
https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463
I didn't notice this logic, can you help describe it?
When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?
https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463
We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.
I didn't notice this logic, can you help describe it?
When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic? https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463
We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.
+1
@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?
@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?
All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied. I think we need to do some investigations in the broker.
@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?
All ledgers are equal in the bookie. For the
__change_eventtopic, the rack allocation cannot be satisfied.
+1.
I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?
I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?
you can.
Fencing is not exactly a cheap operation for application using BK. E.g. in case of BK as a WAL it may cause app crash/restart/recovery from WAL etc, in other cases rolling restart/upgrade of bookies may result in a lot of smaller ledgers than anticipated (growth of metadata size).
There is a config parameter https://github.com/apache/bookkeeper/blob/b8cc1fb10a45073ad38ca26b566f01afb3a31f99/conf/bk_server.conf#L1063-L1065 that you can tune down for your usecase. I think it should work there though I don't remember all intricacies of replication worker.