[Bug][broker]PulsarRegistrationClient writableBookieInfo cache and readOnlyBookieInfo cache update fail causing broker to misjudge that the bookie is unavailable.
Search before asking
- [X] I searched in the issues and found nothing similar.
Read release policy
- [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
OS: Windows and Mac and Linux JDK: 17 Pulsar version: 3.0.5
Minimal reproduce step
Our production environment has two scenarios, including standalone deployment and cluster deployment, and we have not yet found the steps to reproduce them. The current speculation is that broker nodes may experience cache update failures due to failed execution of bookie list metadata change listening events in high load situations or when the common thread pool of ForkJoinPool is blocked.
What did you expect to see?
The bookie process is running normally and there is corresponding bookie information in the metadata. Therefore, the broker's cache should also have corresponding bookie information.
What did you see instead?
The bookie process is running normally and there is corresponding bookie information in the metadata. But the broker's cache does not have corresponding bookie information. And the broker prints the following exception information:
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.172.240.12:54152][persistent://public/default/writeCKDeadLetter-partition-0] Creating producer. producerId=1
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/writeCKDeadLetter-partition-0
2024-06-11T12:26:24,570+0800 [main-EventThread] INFO org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve 127.0.0.1:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
2024-06-11T12:26:24,570+0800 [main-EventThread] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to 127.0.0.1:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId 127.0.0.1:3181, bookie does not exist or it is not running
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31003 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to open ledger 31003: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to initialize managed ledger: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Closing managed ledger
Anything else?
No response
Are you willing to submit a PR?
- [X] I'm willing to submit a PR!