[bugfix] ManagedLedger: move to FENCED state in case of BadVersionException
Motivation
It may happen that a ManagedLedger continues to work even in case of seeing a BadVersionException. For instance background activities like trimming ledgers, or offloading will continue to work. This is very dangerous as it leads to some kind of split brain (it is not actually a split brain): two brokers continue to process the metadata (and data) of the same ledger.
Modifications
- Force moving to Fenced state the ManagedLedger in case of BadVersionException
- Fail the "offload loop" in case of Fenced state
- Better cleanup of the PersistentTopic
Verifying this change
I added tests that cover the changes
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment
Documentation
-
[ ]
doc-required(Your PR needs to update docs and you will update later) -
[x]
doc-not-needed(Please explain why) -
[ ]
doc(Your PR contains doc changes) -
[ ]
doc-complete(Docs have been already added)
Matching PR in forked repository
PR in forked repository: https://github.com/eolivelli/pulsar/pull/16
After opening this PR, the build in apache/pulsar will fail and instructions will be provided for opening a PR in the PR author's forked repository.
apache/pulsar pull requests should be first tested in your own fork since the apache/pulsar CI based on GitHub Actions has constrained resources and quota. GitHub Actions provides separate quota for pull requests that are executed in a forked repository.
The tests will be run in the forked repository until all PR review comments have been handled, the tests pass and the PR is approved by a reviewer.
-->
I am adding test cases in order to have examples of the problems and to prevent regressions in the future
LGTM but it broke testTruncateCorruptDataLedger
Error: Tests run: 9, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 45.675 s <<< FAILURE! - in org.apache.pulsar.broker.service.BrokerBkEnsemblesTests
Error: testTruncateCorruptDataLedger(org.apache.pulsar.broker.service.BrokerBkEnsemblesTests) Time elapsed: 3.13 s <<< FAILURE!
org.apache.pulsar.client.admin.PulsarAdminException$ServerSideErrorException:
--- An unexpected error occurred in the server ---
Message: Can't trim fenced ledger
Stacktrace:
org.apache.bookkeeper.mledger.ManagedLedgerException$ManagedLedgerAlreadyClosedException: Can't trim fenced ledger
@dlg99 I have fixed the test. The test was trying do something that cannot happen. it was calling "ledgerManager.delete()" and then "truncate()".
I removed "delete", the test is still meaningful, because it tests that the truncation works in case of missing ledgers (due to concurrent execution or previous deletion)
We already handled the BadVersion in this way before https://github.com/apache/pulsar/pull/17736/files#diff-f6a849bd8fdb782ef6c17a2e07a2c54c3dc7d1655c00ec3546d5f3b3fc61e970L1537
Looks good to me. Approved the PR