pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[bugfix] ManagedLedger: move to FENCED state in case of BadVersionException

Open eolivelli opened this issue 3 years ago • 4 comments

Motivation

It may happen that a ManagedLedger continues to work even in case of seeing a BadVersionException. For instance background activities like trimming ledgers, or offloading will continue to work. This is very dangerous as it leads to some kind of split brain (it is not actually a split brain): two brokers continue to process the metadata (and data) of the same ledger.

Modifications

  • Force moving to Fenced state the ManagedLedger in case of BadVersionException
  • Fail the "offload loop" in case of Fenced state
  • Better cleanup of the PersistentTopic

Verifying this change

I added tests that cover the changes

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc-required (Your PR needs to update docs and you will update later)

  • [x] doc-not-needed (Please explain why)

  • [ ] doc (Your PR contains doc changes)

  • [ ] doc-complete (Docs have been already added)

Matching PR in forked repository

PR in forked repository: https://github.com/eolivelli/pulsar/pull/16

After opening this PR, the build in apache/pulsar will fail and instructions will be provided for opening a PR in the PR author's forked repository.

apache/pulsar pull requests should be first tested in your own fork since the apache/pulsar CI based on GitHub Actions has constrained resources and quota. GitHub Actions provides separate quota for pull requests that are executed in a forked repository.

The tests will be run in the forked repository until all PR review comments have been handled, the tests pass and the PR is approved by a reviewer.

-->

eolivelli avatar Sep 20 '22 08:09 eolivelli

I am adding test cases in order to have examples of the problems and to prevent regressions in the future

eolivelli avatar Sep 20 '22 09:09 eolivelli

LGTM but it broke testTruncateCorruptDataLedger

  Error:  Tests run: 9, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 45.675 s <<< FAILURE! - in org.apache.pulsar.broker.service.BrokerBkEnsemblesTests
  Error:  testTruncateCorruptDataLedger(org.apache.pulsar.broker.service.BrokerBkEnsemblesTests)  Time elapsed: 3.13 s  <<< FAILURE!
  org.apache.pulsar.client.admin.PulsarAdminException$ServerSideErrorException: 
  
   --- An unexpected error occurred in the server ---
  
  Message: Can't trim fenced ledger
  
  Stacktrace:
  
  org.apache.bookkeeper.mledger.ManagedLedgerException$ManagedLedgerAlreadyClosedException: Can't trim fenced ledger

dlg99 avatar Sep 20 '22 17:09 dlg99

@dlg99 I have fixed the test. The test was trying do something that cannot happen. it was calling "ledgerManager.delete()" and then "truncate()".

I removed "delete", the test is still meaningful, because it tests that the truncation works in case of missing ledgers (due to concurrent execution or previous deletion)

eolivelli avatar Sep 21 '22 08:09 eolivelli

We already handled the BadVersion in this way before https://github.com/apache/pulsar/pull/17736/files#diff-f6a849bd8fdb782ef6c17a2e07a2c54c3dc7d1655c00ec3546d5f3b3fc61e970L1537

Looks good to me. Approved the PR

codelipenghui avatar Sep 21 '22 13:09 codelipenghui