bookkeeper Do not skip opened ledger in repair not adhering placement ledger

Motivation

In order to completely fix not adhering placement ledgers, also repair the opened ledger in feature "auto recover support repaired not adhering placement ledger"

Changes

remove "if (!metadata.isClosed())" in two function.

It is ok to recover opened ledger because the ledger would become fence when do recovery.

Master Issue: #3971

Jun 05 '23 07:06 TakaHiR07

@horizonzy can you take a look of this ?

Jun 06 '23 07:06 TakaHiR07

@horizonzy Please help take a look, thanks.

Jun 13 '23 02:06 hangc0276

It is ok to recover opened ledger because the ledger would become fence when do recovery.

I didn't notice this logic, can you help describe it?

Jun 14 '23 02:06 horizonzy

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463

Jun 14 '23 03:06 TakaHiR07

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

Jun 14 '23 07:06 horizonzy

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic? https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

+1

Jun 15 '23 07:06 wenbingshen

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

Jun 16 '23 06:06 TakaHiR07

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied. I think we need to do some investigations in the broker.

Jun 19 '23 02:06 wenbingshen

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied.

+1.

Jun 19 '23 08:06 horizonzy

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

Jul 03 '23 08:07 StevenLuMT

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

you can.

Jul 05 '23 11:07 TakaHiR07

Fencing is not exactly a cheap operation for application using BK. E.g. in case of BK as a WAL it may cause app crash/restart/recovery from WAL etc, in other cases rolling restart/upgrade of bookies may result in a lot of smaller ledgers than anticipated (growth of metadata size).

There is a config parameter https://github.com/apache/bookkeeper/blob/b8cc1fb10a45073ad38ca26b566f01afb3a31f99/conf/bk_server.conf#L1063-L1065 that you can tune down for your usecase. I think it should work there though I don't remember all intricacies of replication worker.

May 30 '24 19:05 dlg99