bookkeeper
bookkeeper copied to clipboard
[AutoRecovery] keep rereplicate a ledger which maybe deleted
BUG REPORT
Describe the bug
Our Production pulsar cluster is multiple nodes with E-Qw-Qa(3-3-2), enabling auto-recovery by "./bin/bookkeeper shell autorecovery -enable", bookkeeper version is 4.14.1 . Now one bookie server is down, and cluster do autoRecovery. However, there is a ledger can not read by the other 2 ensemble, the error is both : Ledger 1294 not found (It seems the ledger has been deleted)
[BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 1294
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 1294 not found
at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
But the ReplicationWorker still continue to try to rereplicate this ledger, and keep failed. According to the following log, it throw BKNotEnoughBookiesException, therefore ReplicationWorker#run would keep running, keep replicate a can-not-replicated ledger. The result is generating too much recovery read request to the other 2 ensemble bookie, affect the normal read request.
[BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.bookkeeper.client.PendingReadLacOp - While readLac ledger: 1294 did not hear success
responses from all of ensemble
[ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - BKReadException while rereplicating ledger 1294. Enough Bookies might not have available So, no harm to continue
[BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - IOException while trying to read last entry: 1294
org.apache.bookkeeper.bookie.Bookie$NoEntryException: Entry -1 not found in 1294
The zkmetadata has ledger 1294 under /ledgers/underreplication/ledgers
To Reproduce
Expected behavior
Should it skip those deleted ledger when doing recovery ?
We encountered the same problem. When the AutoRecovery service is started with Bookie, it will find that the current Ledger ID has been deleted, but does Auto Recovery have no ability to automatically skip these Ledgers, or has it been trying to recover the deleted Ledgers?
12:42:41.841 [main-EventThread] INFO org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
12:42:41.846 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 8049631
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 8049631 not found
at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.67.Final.jar:4.1.67.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
12:42:41.846 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while trying to read last entry: 8049631
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 8049631 not found
at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1441) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:86) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
We are using version 4.12 of Bookie
16:02:17.008 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 8049631
16:02:17.015 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=4, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[xxxx]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVsc2FxxxxGl0bG9nX2ZpbmRlci1wYXJ0aXRpb24tMjk=, application=base64:cHVsc2Fy}}
16:02:17.008 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 8049631 16:02:17.015 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=4, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[xxxx]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVsc2FxxxxGl0bG9nX2ZpbmRlci1wYXJ0aXRpb24tMjk=, application=base64:cHVsc2Fy}}
My bookie version is 4.14.1 , and I finally solve it by
- disable autoRecovery
- remove this ledger from zk:/ledgers/undereplication/ledgers and restart bookie
- enable autoRecovery
It seems bookie can skip unrecoverable ledger after pr-2870, proposed in bookie 4.14.4