pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix] [ml] Add entry fail due to race condition about add entry failed/timeout and switch ledger

Open poorbarcode opened this issue 1 year ago • 2 comments

Issue 1 without enable managedLedgerAddEntryTimeoutSeconds

Background: Flow when adding entry failed

  • Switch to a new ledger
  • Close and duplicate all the pending add requests[3]
    • The callback of the add entry request will be discarded because it is closed.
  • Redo all the pending add requests.

Issue

time add entries failed switch ledger
1 add the first entry failed
2 start to switch ledger, new ledger was created
3 close all pending OpAddEntry[3]
4 recycle OpAddEntry due to change state to COMPLETED failed[4]
5 get an NPE when duplicating pending OpAddEntry

Reproduce: no test to reproduce this case yet.


Issue 2 that enabled managedLedgerAddEntryTimeoutSeconds

Background: Flow when adding entry timeout

  • (Highlight) Close the first add entry request[1].
  • Switch to a new ledger
  • Close and duplicate all the pending add requests
    • The callback of the add entry request will be discarded because it is closed.
  • Redo all the pending add requests.

Issue The flow when adding entry timeout closing the first add entry request at first, it may cause an issue like this:

  • timeout task - step 1: Close the first add entry request.
  • callback of OpAddEntry 1: discard the callback of the first OpAddEntry because it is closed, recycle the object OpAddEntry.
  • callback of OpAddEntry 2: tries do a completed callback
    • Confirm the first request equals the current one: false, and the first OpAddEntry has been recycled, Pulsar will get an NPE[2].

Reproduce: the new test testAddEntryResponseTimeout


Modifications

  • Recycle OpAddEntry after it is closed when change state to COMPLETED failed.
  • Make the timeout process exactly the same as the process that failed to add

Footnotes

[1]: OpAddEntry.handleAddTimeoutFailure

void handleAddTimeoutFailure(final LedgerHandle ledger, Object ctx) {
  if (checkAndCompleteOp(ctx)) {
      this.close();
      this.handleAddFailure(ledger);
  }
}

[2]: OpAddEntry.run

OpAddEntry firstInQueue = ml.pendingAddEntries.poll();
if (firstInQueue == null) {
    return;
}
if (this != firstInQueue) {
    firstInQueue.failed(new ManagedLedgerException("Unexpected add entry op when complete the add entry op."));
    return;
}

[3]: ManagedLedgerImpl.createNewOpAddEntryForNewLedger

do {
    existsOp = pendingAddEntries.poll();
    existsOp.close();
    pendingAddEntries.add(OpAddEntry.createNoRetainBuffer(....));
} while (...)

[4]: OpAddEntry.addComplete

if (!STATE_UPDATER.compareAndSet(OpAddEntry.this, State.INITIATED, State.COMPLETED)) {
    OpAddEntry.this.recycle();
    return;
}

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: x

poorbarcode avatar Mar 07 '24 18:03 poorbarcode

/pulsarbot rerun-failure-checks

poorbarcode avatar Apr 15 '24 09:04 poorbarcode

Rebase master.

poorbarcode avatar Apr 15 '24 11:04 poorbarcode

@poorbarcode Could you please rebase to the master branch since the last commit of this PR was happened a month ago.

codelipenghui avatar May 20 '24 12:05 codelipenghui

@poorbarcode Could you please rebase to the master branch since the last commit of this PR was happened a month ago.

Rebased from branch-master.

poorbarcode avatar May 20 '24 12:05 poorbarcode

Codecov Report

Attention: Patch coverage is 95.65217% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 73.20%. Comparing base (bbc6224) to head (cda2b7d). Report is 291 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22221      +/-   ##
============================================
- Coverage     73.57%   73.20%   -0.37%     
+ Complexity    32624     2511   -30113     
============================================
  Files          1877     1889      +12     
  Lines        139502   141424    +1922     
  Branches      15299    15518     +219     
============================================
+ Hits         102638   103536     +898     
- Misses        28908    29883     +975     
- Partials       7956     8005      +49     
Flag Coverage Δ
inttests 27.44% <30.43%> (+2.86%) :arrow_up:
systests 24.66% <43.47%> (+0.33%) :arrow_up:
unittests 72.22% <95.65%> (-0.63%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/apache/bookkeeper/mledger/impl/OpAddEntry.java 73.84% <100.00%> (+1.02%) :arrow_up:
...okkeeper/mledger/impl/ShadowManagedLedgerImpl.java 57.35% <100.00%> (+1.68%) :arrow_up:
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 81.12% <92.30%> (+0.45%) :arrow_up:

... and 344 files with indirect coverage changes

codecov-commenter avatar May 21 '24 03:05 codecov-commenter

@poorbarcode I have created #23208 about the excessive log warnings that this change causes. Please take a look.

lhotari avatar Aug 21 '24 04:08 lhotari

PR that fixes the excessive warnings: #23209

lhotari avatar Aug 21 '24 10:08 lhotari