bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

[AutoRecovery] keep rereplicate a ledger which maybe deleted

Open TakaHiR07 opened this issue 2 years ago • 4 comments

BUG REPORT

Describe the bug

Our Production pulsar cluster is multiple nodes with E-Qw-Qa(3-3-2), enabling auto-recovery by "./bin/bookkeeper shell autorecovery -enable", bookkeeper version is 4.14.1 . Now one bookie server is down, and cluster do autoRecovery. However, there is a ledger can not read by the other 2 ensemble, the error is both : Ledger 1294 not found (It seems the ledger has been deleted)

[BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 1294
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 1294 not found
        at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
        at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]

But the ReplicationWorker still continue to try to rereplicate this ledger, and keep failed. According to the following log, it throw BKNotEnoughBookiesException, therefore ReplicationWorker#run would keep running, keep replicate a can-not-replicated ledger. The result is generating too much recovery read request to the other 2 ensemble bookie, affect the normal read request.

[BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.client.PendingReadLacOp - While readLac ledger: 1294 did not hear success 
responses from all of ensemble
[ReplicationWorker] INFO  org.apache.bookkeeper.replication.ReplicationWorker - BKReadException while rereplicating ledger 1294. Enough Bookies might not have available So, no harm to continue
[BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - IOException while trying to read last entry: 1294
org.apache.bookkeeper.bookie.Bookie$NoEntryException: Entry -1 not found in 1294

The zkmetadata has ledger 1294 under /ledgers/underreplication/ledgers

企业微信截图_40feaa39-c1d5-4ce1-ae37-6a7f1eadd339

To Reproduce

Expected behavior

Should it skip those deleted ledger when doing recovery ?

TakaHiR07 avatar Jun 06 '22 12:06 TakaHiR07

We encountered the same problem. When the AutoRecovery service is started with Bookie, it will find that the current Ledger ID has been deleted, but does Auto Recovery have no ability to automatically skip these Ledgers, or has it been trying to recover the deleted Ledgers?

wolfstudy avatar Aug 03 '22 05:08 wolfstudy

12:42:41.841 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
12:42:41.846 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 8049631
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 8049631 not found
        at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.67.Final.jar:4.1.67.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
12:42:41.846 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while trying to read last entry: 8049631
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 8049631 not found
        at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1441) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:86) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]

We are using version 4.12 of Bookie

wolfstudy avatar Aug 03 '22 05:08 wolfstudy

16:02:17.008 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 8049631
16:02:17.015 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=4, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[xxxx]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVsc2FxxxxGl0bG9nX2ZpbmRlci1wYXJ0aXRpb24tMjk=, application=base64:cHVsc2Fy}}

wolfstudy avatar Aug 03 '22 08:08 wolfstudy

16:02:17.008 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 8049631
16:02:17.015 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=4, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[xxxx]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVsc2FxxxxGl0bG9nX2ZpbmRlci1wYXJ0aXRpb24tMjk=, application=base64:cHVsc2Fy}}

My bookie version is 4.14.1 , and I finally solve it by

  1. disable autoRecovery
  2. remove this ledger from zk:/ledgers/undereplication/ledgers and restart bookie
  3. enable autoRecovery

It seems bookie can skip unrecoverable ledger after pr-2870, proposed in bookie 4.14.4

TakaHiR07 avatar Aug 03 '22 08:08 TakaHiR07