[fix][broker] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService
Fixes #24977
Motivation
As shown in the issue, fix two problem: 1. cleanCacheAndCloseReader() executed twice cause concurrent error, which result in too many orphan reader remain in SystemTopicBasedTopicPoliciesService 2. double update in policyCacheInitMap cause recursive update error
Modifications
- do cleanPoliciesCacheInitMap only once when throw exception
- avoid double update in policyCacheInitMap. use putIfAbsent instead of computeIfAbsent. It is not appropriate to add so many operation into compute().
- add two test, to simulate if throw exception in createReader, initPolicyCache, readMorePolicy of prepareInitPoliciesCacheAsync. By the way, it seems lack of unittest in SystemTopicBasedTopicPoliciesService.
- "newReader()" remove some logic, it is confused when readCompletableFuture throw exception.
- not remove cleanPoliciesCacheInitMap() in initPolicesCache() when closed.get()==true, since broker is closed, clean twice is ok.
There is one point should be consider in this pr
- When use putIfAbsent, if too many getTopicPolicy() trigger prepareInitPoliciesCacheAsync, it would generate many empty completableFuture. Further more, we can use double check in the code to avoid this object gc.(the code would be ugly).
Besides, this case still exist: if failed to close reader in cleanCacheAndCloseReader(), this closing reader maybe have chance to reconnect and become orphan reader. This is not this pr's work.
Verifying this change
- [ ] Make sure that the change passes the CI checks.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 74.29%. Comparing base (6fdb4b9) to head (33ae945).
:warning: Report is 10 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #24980 +/- ##
=========================================
Coverage 74.29% 74.29%
- Complexity 34026 34066 +40
=========================================
Files 1920 1920
Lines 150252 150252
Branches 17428 17428
=========================================
+ Hits 111634 111636 +2
- Misses 29706 29735 +29
+ Partials 8912 8881 -31
| Flag | Coverage Δ | |
|---|---|---|
| inttests | 26.17% <75.00%> (-0.39%) |
:arrow_down: |
| systests | 22.87% <67.85%> (-0.02%) |
:arrow_down: |
| unittests | 73.84% <100.00%> (+0.02%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| .../service/SystemTopicBasedTopicPoliciesService.java | 77.86% <100.00%> (+0.19%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Which logic could cause this issue ?
Request-1: policyCacheInitMap put future1
Request-1: create reader1
Request-1: readerCaches put reader1
reader1 read error
Request-1: first time cleanCacheAndCloseReader(), include:
remove reader1 in readerCaches
close reader1
remove future1 in policyCacheInitMap
Request-2: policyCacheInitMap put future2
Request-1: second time cleanCacheAndCloseReader(), only remove future2 in policyCacheInitMap
Request-2: create reader2
Request-2: readerCaches put reader2
Request-3: policyCacheInitMap put future3
Request-3: create reader3
Request-3: readerCaches put reader3
Which logic could cause this issue ?
Request-1: policyCacheInitMap put future1 Request-1: create reader1 Request-1: readerCaches put reader1 reader1 read error Request-1: first time cleanCacheAndCloseReader(), include: remove reader1 in readerCaches close reader1 remove future1 in policyCacheInitMap Request-2: policyCacheInitMap put future2 Request-1: second time cleanCacheAndCloseReader(), only remove future2 in policyCacheInitMap Request-2: create reader2 Request-2: readerCaches put reader2 Request-3: policyCacheInitMap put future3 Request-3: create reader3 Request-3: readerCaches put reader3
Is this bug existed in the 3.0.x , not the latest version ?
Which logic could cause this issue ?
@Technoboy- restart broker with version-3.0.x. Restart broker-1, and after a few time restart broker-2. When load topic and getTopicPolicy on broker-1, the corresponding __change_event topic on broker-2 is unload.
I don't use the latest version. Maybe in latest version, this concurrent case is avoid by pr-24658. But it still catch the exception and cleanCacheAndPolicyMap twice, this is dangerous.
Which logic could cause this issue ?
@Technoboy- restart broker with version-3.0.x. Restart broker-1, and after a few time restart broker-2. When load topic and getTopicPolicy on broker-1, the corresponding __change_event topic on broker-2 is unload.
I don't use the latest version. Maybe in latest version, this concurrent case is avoid by pr-24658. But it still catch the exception and cleanCacheAndPolicyMap twice, this is dangerous.
How could the latest code cause the issue ? I'm not understand
How could the latest code cause the issue ? I'm not understand
You can see the code in branch-3.0. Latest code is a bit different, the concurrent case is found on branch-3.0
How could the latest code cause the issue ? I'm not understand
You can see the code in branch-3.0. Latest code is a bit different, the concurrent case is found on branch-3.0
So it's better to fix it to branch-3.0, for the master branch, I don't think it's needed.
So it's better to fix it to branch-3.0, for the master branch, I don't think it's needed.
@Technoboy- I think it is better to also fix in master branch. Since the current code in master branch is for improvement, not a true fix, and still have risk.
As shown in the issue, fix two problem: 1. cleanCacheAndCloseReader() executed twice cause concurrent error, which result in too many orphan reader remain in SystemTopicBasedTopicPoliciesService 2. double update in policyCacheInitMap cause recursive update error
I think that this problem exists also in master branch and therefore merging this PR and cherry-picking it to maintenance branches makes sense.
Depends on #24658 for branch-4.0 and branch-4.1
Flaky test #25081, please take a look