pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix] [broker] PulsarRegistrationClient writableBookieInfo cache and readOnlyBookieInfo cache update fail causing broker to misjudge that the bookie is unavailable

Open yyj8 opened this issue 1 year ago • 1 comments

Fixes #23020

Main Issue: #xyz

PIP: #xyz

Motivation

The bookie process is running normally and there is corresponding bookie information in the metadata. Therefore, the broker's cache should also have corresponding bookie information. But the broker's cache does not have corresponding bookie information and causing broker to misjudge that the bookie is unavailable.

The broker prints the following exception information:

2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.172.240.12:54152][persistent://public/default/writeCKDeadLetter-partition-0] Creating producer. producerId=1
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/writeCKDeadLetter-partition-0
2024-06-11T12:26:24,570+0800 [main-EventThread] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve 127.0.0.1:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
2024-06-11T12:26:24,570+0800 [main-EventThread] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to 127.0.0.1:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId 127.0.0.1:3181, bookie does not exist or it is not running
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31003 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to open ledger 31003: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to initialize managed ledger: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Closing managed ledger

Modifications

pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java#getBookieServiceInfo

https://github.com/yyj8/pulsar/commit/c52069f9ccd665e86802a103652e08264eecb63d#diff-7a5305b98183695c3d8246b9e9ccafa68180d7f9e8df5534019d6bbcc59a90f6

Verifying this change

  • [x] Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository:

yyj8 avatar Jul 10 '24 14:07 yyj8

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

github-actions[bot] avatar Jul 10 '24 14:07 github-actions[bot]

It seems that #20642 attempted to fix some issues in this area. Useful for more context.

lhotari avatar May 14 '25 08:05 lhotari