pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

Brokers are not able to load topics in non-recoverable situation

Open rdhabalia opened this issue 2 years ago • 1 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

We have introduced a configuration called “autoSkipNonRecoverableData” before open-sourcing Pulsar as we have come across with various situations when it was not possible to recover ledgers belonging to managed-ledger or managed-cursors and the broker was not able to load the topics. In such situations,”autoSkipNonRecoverableData” flag helps to skip non-recoverable leger-recovery errors such as ledger_not_found and allows the broker to load topics by skipping such ledgers in disaster recovery. Brokers can recognize such non-recoverable errors using bookkeeper error codes but in some cases, it’s very tricky and not possible to conclude non-recoverable errors. For example, the broker can not differentiate between all the ensemble bookies of the ledgers that are temporarily unavailable or are permanently removed from the cluster without graceful recovery, and because of that broker doesn’t consider all the bookies deleted as a non-recoverable error though we can not recover ledgers in such situations where all the bookies are removed due to various reasons such as Dev cluster clean up or system faced data disaster with multiple bookie loss. In such situations, the system admin has to manually identify such non-recoverable topics and update those topics’ managed-ledger and managed-cursor’s metadata and reload topics again which requires a lot of manual effort and sometimes it might not be feasible to handle such situations with a large number of topics that require this manual procedure to fix those topics. Error:

2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Opened ledger 94267 for consumer pulsar.repl.dev-eastb. rc=0
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve euw1bk--prod-booki
e-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle i
s not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to euw1bk--prod-bookie-0.
euw1bk--prod-bookie.bk-.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$Bookie
IdNotResolvedException: Cannot resolve bookieId euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie does not exist or it is not 
running
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L94267
 E46 from bookie: euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve euw1bk--prod-booki
e-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle i
s not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to euw1bk--prod-bookie-2.
euw1bk--prod-bookie.bk-.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$Bookie
IdNotResolvedException: Cannot resolve bookieId euw1bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie does not exist or it is not 
running
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L94267
 E46 from bookie: euw1bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L94267 E46-E46, Sent to [euw1
bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster
.local:3181], Heard from [] : bitset = {}, Error = 'Bookie handle is not available'. First unread entry is (-1, rc = null)
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Error reading from metadata ledger 94267 for consumer pulsar.repl.dev-eastb: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Recovery for cursor pulsar.repl.dev-eastb failed
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Closing managed ledger
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://prop/us-west2/ns1/per
sistent/t1-partition-220
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/1.1.1.1:59962][persistent://prop/us-west2/ns1/t1-partition-220][my-local] Failed to create consumer: consumerId=5369828, org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available

Solution

Therefore, the system admin should have a dynamic configuration called managedLedgerForceRecovery to use in such situations to allow brokers to forcefully load topics by skipping ledger failures to avoid topic unavailability and perform auto repairs of the topics. This will allow the admin to handle disaster recovery situations in a controlled and automated manner and maintain the topic availability by mitigating such failures.

Alternatives

No response

Anything else?

No response

Are you willing to submit a PR?

  • [X] I'm willing to submit a PR!

rdhabalia avatar Dec 19 '23 04:12 rdhabalia

Reopen the issue since the PR is in processing https://github.com/apache/pulsar/pull/21759

dao-jun avatar May 14 '24 06:05 dao-jun