[fix] [broker] PulsarRegistrationClient writableBookieInfo cache and readOnlyBookieInfo cache update fail causing broker to misjudge that the bookie is unavailable
Fixes #23020
Main Issue: #xyz
PIP: #xyz
Motivation
The bookie process is running normally and there is corresponding bookie information in the metadata. Therefore, the broker's cache should also have corresponding bookie information. But the broker's cache does not have corresponding bookie information and causing broker to misjudge that the bookie is unavailable.
The broker prints the following exception information:
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.172.240.12:54152][persistent://public/default/writeCKDeadLetter-partition-0] Creating producer. producerId=1
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/writeCKDeadLetter-partition-0
2024-06-11T12:26:24,570+0800 [main-EventThread] INFO org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve 127.0.0.1:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
2024-06-11T12:26:24,570+0800 [main-EventThread] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to 127.0.0.1:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId 127.0.0.1:3181, bookie does not exist or it is not running
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31003 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to open ledger 31003: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to initialize managed ledger: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Closing managed ledger
Modifications
pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java#getBookieServiceInfo
https://github.com/yyj8/pulsar/commit/c52069f9ccd665e86802a103652e08264eecb63d#diff-7a5305b98183695c3d8246b9e9ccafa68180d7f9e8df5534019d6bbcc59a90f6
Verifying this change
- [x] Make sure that the change passes the CI checks.
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end deployment with large payloads (10MB)
- Extended integration test for recovery after broker failure
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository:
@yyj8 Please add the following content to your PR description and select a checkbox:
- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->
It seems that #20642 attempted to fix some issues in this area. Useful for more context.