[fix] [ml] Add entry fail due to race condition about add entry failed/timeout and switch ledger
Issue 1 without enable managedLedgerAddEntryTimeoutSeconds
Background: Flow when adding entry failed
- Switch to a new ledger
- Close and duplicate all the pending add requests[3]
- The callback of the add entry request will be discarded because it is closed.
- Redo all the pending add requests.
Issue
| time | add entries failed | switch ledger |
|---|---|---|
| 1 | add the first entry failed | |
| 2 | start to switch ledger, new ledger was created | |
| 3 | close all pending OpAddEntry[3] | |
| 4 | recycle OpAddEntry due to change state to COMPLETED failed[4] |
|
| 5 | get an NPE when duplicating pending OpAddEntry |
Reproduce: no test to reproduce this case yet.
Issue 2 that enabled managedLedgerAddEntryTimeoutSeconds
Background: Flow when adding entry timeout
- (Highlight) Close the first add entry request[1].
- Switch to a new ledger
- Close and duplicate all the pending add requests
- The callback of the add entry request will be discarded because it is closed.
- Redo all the pending add requests.
Issue The flow when adding entry timeout closing the first add entry request at first, it may cause an issue like this:
-
timeout task - step 1: Close the first add entry request. -
callback of OpAddEntry 1: discard the callback of the first OpAddEntry because it is closed, recycle the object OpAddEntry. -
callback of OpAddEntry 2: tries do a completed callback- Confirm the first request equals the current one:
false, and the first OpAddEntry has been recycled, Pulsar will get an NPE[2].
- Confirm the first request equals the current one:
Reproduce: the new test testAddEntryResponseTimeout
Modifications
- Recycle OpAddEntry after it is closed when change state to
COMPLETEDfailed. - Make the timeout process exactly the same as the process that failed to add
Footnotes
[1]: OpAddEntry.handleAddTimeoutFailure
void handleAddTimeoutFailure(final LedgerHandle ledger, Object ctx) {
if (checkAndCompleteOp(ctx)) {
this.close();
this.handleAddFailure(ledger);
}
}
[2]: OpAddEntry.run
OpAddEntry firstInQueue = ml.pendingAddEntries.poll();
if (firstInQueue == null) {
return;
}
if (this != firstInQueue) {
firstInQueue.failed(new ManagedLedgerException("Unexpected add entry op when complete the add entry op."));
return;
}
[3]: ManagedLedgerImpl.createNewOpAddEntryForNewLedger
do {
existsOp = pendingAddEntries.poll();
existsOp.close();
pendingAddEntries.add(OpAddEntry.createNoRetainBuffer(....));
} while (...)
if (!STATE_UPDATER.compareAndSet(OpAddEntry.this, State.INITIATED, State.COMPLETED)) {
OpAddEntry.this.recycle();
return;
}
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: x
/pulsarbot rerun-failure-checks
Rebase master.
@poorbarcode Could you please rebase to the master branch since the last commit of this PR was happened a month ago.
@poorbarcode Could you please rebase to the master branch since the last commit of this PR was happened a month ago.
Rebased from branch-master.
Codecov Report
Attention: Patch coverage is 95.65217% with 1 lines in your changes are missing coverage. Please review.
Project coverage is 73.20%. Comparing base (
bbc6224) to head (cda2b7d). Report is 291 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #22221 +/- ##
============================================
- Coverage 73.57% 73.20% -0.37%
+ Complexity 32624 2511 -30113
============================================
Files 1877 1889 +12
Lines 139502 141424 +1922
Branches 15299 15518 +219
============================================
+ Hits 102638 103536 +898
- Misses 28908 29883 +975
- Partials 7956 8005 +49
| Flag | Coverage Δ | |
|---|---|---|
| inttests | 27.44% <30.43%> (+2.86%) |
:arrow_up: |
| systests | 24.66% <43.47%> (+0.33%) |
:arrow_up: |
| unittests | 72.22% <95.65%> (-0.63%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files | Coverage Δ | |
|---|---|---|
| ...org/apache/bookkeeper/mledger/impl/OpAddEntry.java | 73.84% <100.00%> (+1.02%) |
:arrow_up: |
| ...okkeeper/mledger/impl/ShadowManagedLedgerImpl.java | 57.35% <100.00%> (+1.68%) |
:arrow_up: |
| ...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java | 81.12% <92.30%> (+0.45%) |
:arrow_up: |
@poorbarcode I have created #23208 about the excessive log warnings that this change causes. Please take a look.
PR that fixes the excessive warnings: #23209