pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][txn] fix concurrent error cause txn stuck in TransactionBufferHandlerImpl#endTxn

Open TakaHiR07 opened this issue 1 year ago • 1 comments

Fixes https://github.com/apache/pulsar/issues/23550

Motivation

After diving into the code, finding that there is a concurrent error in TransactionBufferHandlerImpl#checkRequestCredits(), checkPendingRequests(), which would cause the above issue.

Currently, we have config TransactionBufferClientMaxConcurrentRequests to control the concurrent request number. However, if the request and response is executed as follow, the request would permanently stuck in queue. (to simplify the case, let's set permit is 1)

step request-1 request-2 response-1 request-3
1 start do checkRequestCredits()
2 compareAndSet requestCredits to 0
3 execute endTxn
4 start do checkRequestCredits()
5 get currentPermit = 0
6 trigger onResponse(), set requestCredits to 1
7 trigger checkPendingRequests(), permit == 1 && pendingRequests is null, so break the while process
8 currentPermits == 0 && pendingRequest is null, then add op to pendingRequest
9 start do checkRequestCredits()
10 currentPermit == 1 && pendingRequests is not null , also add op to pendingRequest

Now we can find there is no response can trigger pendingRequest.remove, and then all the new requests just add to pendingRequest but permanently not execute.

Modifications

The root reason is currently only onResponse() can trigger pendingRequest.remove. But when we execute onResponse(), the requestOp may not have been added to pendingRequest.

  • So one modification is to let it can check the pendingRequest queue in checkRequestCredits()
  • And the while(true) in checkPendingRequests() is not necessary, 1 response come back, take 1 requestOp from pendingRequest is OK.

It is hard to add test for this concurrent case.

Verifying this change

  • [ ] Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository:

TakaHiR07 avatar Nov 04 '24 08:11 TakaHiR07

@codelipenghui @congbobo184 Can you help review this pr?

TakaHiR07 avatar Nov 04 '24 08:11 TakaHiR07

Codecov Report

:x: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 74.29%. Comparing base (676ba07) to head (ba2daca). :warning: Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...tion/buffer/impl/TransactionBufferHandlerImpl.java 33.33% 1 Missing and 1 partial :warning:
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #23551       +/-   ##
=============================================
+ Coverage     38.56%   74.29%   +35.73%     
- Complexity    13262    33920    +20658     
=============================================
  Files          1856     1913       +57     
  Lines        145287   149503     +4216     
  Branches      16877    17372      +495     
=============================================
+ Hits          56025   111074    +55049     
+ Misses        81696    29582    -52114     
- Partials       7566     8847     +1281     
Flag Coverage Δ
inttests 26.24% <0.00%> (+0.06%) :arrow_up:
systests 22.75% <0.00%> (-0.01%) :arrow_down:
unittests 73.81% <33.33%> (+39.07%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...tion/buffer/impl/TransactionBufferHandlerImpl.java 66.25% <33.33%> (+15.32%) :arrow_up:

... and 1410 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Jul 30 '25 23:07 codecov-commenter