pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug][broker]PulsarRegistrationClient writableBookieInfo cache and readOnlyBookieInfo cache update fail causing broker to misjudge that the bookie is unavailable.

Open yyj8 opened this issue 1 year ago • 0 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Read release policy

  • [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

OS: Windows and Mac and Linux JDK: 17 Pulsar version: 3.0.5

Minimal reproduce step

Our production environment has two scenarios, including standalone deployment and cluster deployment, and we have not yet found the steps to reproduce them. The current speculation is that broker nodes may experience cache update failures due to failed execution of bookie list metadata change listening events in high load situations or when the common thread pool of ForkJoinPool is blocked.

What did you expect to see?

The bookie process is running normally and there is corresponding bookie information in the metadata. Therefore, the broker's cache should also have corresponding bookie information.

What did you see instead?

The bookie process is running normally and there is corresponding bookie information in the metadata. But the broker's cache does not have corresponding bookie information. And the broker prints the following exception information:

2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.172.240.12:54152][persistent://public/default/writeCKDeadLetter-partition-0] Creating producer. producerId=1
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/writeCKDeadLetter-partition-0
2024-06-11T12:26:24,570+0800 [main-EventThread] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve 127.0.0.1:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
2024-06-11T12:26:24,570+0800 [main-EventThread] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to 127.0.0.1:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId 127.0.0.1:3181, bookie does not exist or it is not running
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31003 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to open ledger 31003: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to initialize managed ledger: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Closing managed ledger

Anything else?

No response

Are you willing to submit a PR?

  • [X] I'm willing to submit a PR!

yyj8 avatar Jul 10 '24 14:07 yyj8