bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

Do not skip opened ledger in repair not adhering placement ledger

Open TakaHiR07 opened this issue 2 years ago • 12 comments

Motivation

In order to completely fix not adhering placement ledgers, also repair the opened ledger in feature "auto recover support repaired not adhering placement ledger"

Changes

remove "if (!metadata.isClosed())" in two function.

It is ok to recover opened ledger because the ledger would become fence when do recovery.

Master Issue: #3971

TakaHiR07 avatar Jun 05 '23 07:06 TakaHiR07

@horizonzy can you take a look of this ?

TakaHiR07 avatar Jun 06 '23 07:06 TakaHiR07

@horizonzy Please help take a look, thanks.

hangc0276 avatar Jun 13 '23 02:06 hangc0276

It is ok to recover opened ledger because the ledger would become fence when do recovery.

I didn't notice this logic, can you help describe it?

horizonzy avatar Jun 14 '23 02:06 horizonzy

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463

TakaHiR07 avatar Jun 14 '23 03:06 TakaHiR07

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

horizonzy avatar Jun 14 '23 07:06 horizonzy

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic? https://github.com/apache/bookkeeper/blob/c8c593ee6d3e1f8970bc90b4b6785ff405c81cc4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L461-L463

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

+1

wenbingshen avatar Jun 15 '23 07:06 wenbingshen

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

TakaHiR07 avatar Jun 16 '23 06:06 TakaHiR07

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied. I think we need to do some investigations in the broker.

wenbingshen avatar Jun 19 '23 02:06 wenbingshen

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied.

+1.

horizonzy avatar Jun 19 '23 08:06 horizonzy

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

StevenLuMT avatar Jul 03 '23 08:07 StevenLuMT

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

you can.

TakaHiR07 avatar Jul 05 '23 11:07 TakaHiR07

Fencing is not exactly a cheap operation for application using BK. E.g. in case of BK as a WAL it may cause app crash/restart/recovery from WAL etc, in other cases rolling restart/upgrade of bookies may result in a lot of smaller ledgers than anticipated (growth of metadata size).

There is a config parameter https://github.com/apache/bookkeeper/blob/b8cc1fb10a45073ad38ca26b566f01afb3a31f99/conf/bk_server.conf#L1063-L1065 that you can tune down for your usecase. I think it should work there though I don't remember all intricacies of replication worker.

dlg99 avatar May 30 '24 19:05 dlg99