[improve][broker] PIP-327: Support force topic loading for unrecoverable errors
It fixes https://github.com/apache/pulsar/issues/21751
PIP: #21752
Motivation
We have introduced a configuration called autoSkipNonRecoverableData before open-sourcing Pulsar as we have come across with various situations when it was not possible to recover ledgers belonging to managed-ledger or managed-cursors and the broker was not able to load the topics. In such situations,autoSkipNonRecoverableData flag helps to skip non-recoverable leger-recovery errors such as ledger_not_found and allows the broker to load topics by skipping such ledgers in disaster recovery.
Brokers can recognize such non-recoverable errors using bookkeeper error codes but in some cases, it’s very tricky and not possible to conclude non-recoverable errors. For example, the broker can not differentiate between all the ensemble bookies of the ledgers that are temporarily unavailable or are permanently removed from the cluster without graceful recovery, and because of that broker doesn’t consider all the bookies deleted as a non-recoverable error though we can not recover ledgers in such situations where all the bookies are removed due to various reasons such as Dev cluster clean up or system faced data disaster with multiple bookie loss. In such situations, the system admin has to manually identify such non-recoverable topics and update those topics’ managed-ledger and managed-cursor’s metadata and reload topics again which requires a lot of manual effort and sometimes it might not be feasible to handle such situations with a large number of topics that require this manual procedure to fix those topics.
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Opened ledger 94267 for consumer pulsar.repl.dev-eastb. rc=0
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve euw1bk--prod-booki
e-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle i
s not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to euw1bk--prod-bookie-0.
euw1bk--prod-bookie.bk-.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$Bookie
IdNotResolvedException: Cannot resolve bookieId euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie does not exist or it is not
running
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L94267
E46 from bookie: euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster.local:3181
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve euw1bk--prod-booki
e-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle i
s not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to euw1bk--prod-bookie-2.
euw1bk--prod-bookie.bk-.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$Bookie
IdNotResolvedException: Cannot resolve bookieId euw1bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, bookie does not exist or it is not
running
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L94267
E46 from bookie: euw1bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L94267 E46-E46, Sent to [euw1
bk--prod-bookie-2.euw1bk--prod-bookie.bk-.svc.cluster.local:3181, euw1bk--prod-bookie-0.euw1bk--prod-bookie.bk-.svc.cluster
.local:3181], Heard from [] : bitset = {}, Error = 'Bookie handle is not available'. First unread entry is (-1, rc = null)
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Error reading from metadata ledger 94267 for consumer pulsar.repl.dev-eastb: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Recovery for cursor pulsar.repl.dev-eastb failed
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/us-west2/ns1/per
sistent/t1-partition-220] Closing managed ledger
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://prop/us-west2/ns1/per
sistent/t1-partition-220
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
2023-11-02T00:35:35,582+0000 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN org.apache.pulsar.broker.service.ServerCnx - [/1.1.1.1:59962][persistent://prop/us-west2/ns1/t1-partition-220][my-local] Failed to create consumer: consumerId=5369828, org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie handle is not available
Modifications
Therefore, the system admin should have a dynamic configuration called managedLedgerForceRecovery to use in such situations to allow brokers to forcefully load topics by skipping ledger failures to avoid topic unavailability and perform auto repairs of the topics. This will allow the admin to handle disaster recovery situations in a controlled and automated manner and maintain the topic availability by mitigating such failures.
Verifying this change
- [ ] Make sure that the change passes the CI checks.
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end deployment with large payloads (10MB)
- Extended integration test for recovery after broker failure
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository:
Should we consider the exceptions when initializing the ManagedLedgerImpl?
@rdhabalia This PIP are approved, are we still working on the PR?
Codecov Report
Attention: Patch coverage is 76.19048% with 5 lines in your changes missing coverage. Please review.
Project coverage is 74.57%. Comparing base (
bbc6224) to head (97ccefa). Report is 631 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #21759 +/- ##
============================================
+ Coverage 73.57% 74.57% +0.99%
- Complexity 32624 34561 +1937
============================================
Files 1877 1936 +59
Lines 139502 145378 +5876
Branches 15299 15893 +594
============================================
+ Hits 102638 108410 +5772
+ Misses 28908 28668 -240
- Partials 7956 8300 +344
| Flag | Coverage Δ | |
|---|---|---|
| inttests | 27.92% <57.14%> (+3.34%) |
:arrow_up: |
| systests | 24.52% <47.61%> (+0.20%) |
:arrow_up: |
| unittests | 73.91% <66.66%> (+1.06%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...apache/bookkeeper/mledger/ManagedLedgerConfig.java | 96.44% <100.00%> (+0.15%) |
:arrow_up: |
| ...org/apache/pulsar/broker/ServiceConfiguration.java | 98.97% <100.00%> (-0.43%) |
:arrow_down: |
| ...rg/apache/pulsar/broker/service/BrokerService.java | 83.54% <100.00%> (+2.76%) |
:arrow_up: |
| ...che/bookkeeper/mledger/impl/ManagedCursorImpl.java | 80.18% <66.66%> (+0.88%) |
:arrow_up: |
| ...broker/service/schema/BookkeeperSchemaStorage.java | 77.97% <60.00%> (+0.40%) |
:arrow_up: |
This PR introduced a new flaky test, #23417.