[improve][offloaders] Automatically evict Offloaded Ledgers from memory
Motivation
ManagedLedgerImpl retains eagerly a cache of all the BookKeeper ReadHandles. In case of Offloaded ReadHandler there is kind of a memory leak, as each BlobStoreBackedInputStreamImpl retains a DirectBuffer of 1MB, in the case of a topic with terabytes of data and thousands of ledger this leads to Direct OOM errors if something tries to open all the ledgers
Modifications
Add a new background activity that evicts from memory all the Offloaded ReadHandles and release memory.
The feature is controlled by the new configuration option managedLedgerInactiveOffloadedLedgerEvictionTimeSeconds=600
Unfortunately this fix cannot fully prevent a OODM error because there is no global count and limit of the memory retained by such Handles, it allows to mitigate the problem by releasing automatically unused Ledger Handlers. The default value, 10 minutes, is very conservative, but it should work with real-world ledgers.
The worst case scenario is a topic with tens of thousands of small ledgers with a consumer that reads from the topic from the beginning, in this case the broker will open the ReadHandlers probably more quickly than the eviction process pace.
Verifying this change
This change added tests.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: https://github.com/eolivelli/pulsar/pull/22
@eolivelli We should start with a proposal for a new behavior that will introduced to Pulsar, and it also introduced a new configuration.
Since we will start the RC version of 3.0.0 on 2023-04-11, I will change the label/milestone of PR who have not been merged.
- The PR of type
featureis deferred to3.1.0 - The PR of type
fixis deferred to3.0.1
So drag this PR to 3.0.1
The pr had no activity for 30 days, mark with Stale label.
@eolivelli Is this something for the near term?
The pr had no activity for 30 days, mark with Stale label.
@eolivelli @dlg99 Any chance to rebase this PR?
I have rebased this PR by merging in origin/master to this PR and resolving the conflicts.
We already have
invalidateReadHandlemethod to invalidate the read handles. Why should we introduce a time-based invalidation for the offloaded read handles? Or, if the change wants to introduce time-based read handle invalidation, why it can't apply to the bookkeeper read handle?
Good point @codelipenghui
Codecov Report
Attention: Patch coverage is 81.03448% with 11 lines in your changes missing coverage. Please review.
Project coverage is 74.34%. Comparing base (
bbc6224) to head (0315714). Report is 685 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #19783 +/- ##
============================================
+ Coverage 73.57% 74.34% +0.77%
- Complexity 32624 34417 +1793
============================================
Files 1877 1944 +67
Lines 139502 147027 +7525
Branches 15299 16201 +902
============================================
+ Hits 102638 109314 +6676
- Misses 28908 29286 +378
- Partials 7956 8427 +471
| Flag | Coverage Δ | |
|---|---|---|
| inttests | 27.45% <48.07%> (+2.87%) |
:arrow_up: |
| systests | 24.44% <57.69%> (+0.11%) |
:arrow_up: |
| unittests | 73.72% <81.03%> (+0.87%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...apache/bookkeeper/mledger/ManagedLedgerConfig.java | 96.57% <100.00%> (+0.27%) |
:arrow_up: |
| ...org/apache/pulsar/broker/ServiceConfiguration.java | 98.98% <100.00%> (-0.42%) |
:arrow_down: |
| ...rg/apache/pulsar/broker/service/BrokerService.java | 83.89% <100.00%> (+3.11%) |
:arrow_up: |
| ...ache/bookkeeper/mledger/OffloadedLedgerHandle.java | 0.00% <0.00%> (ø) |
|
| ...oad/jcloud/impl/BlobStoreBackedReadHandleImpl.java | 77.18% <83.33%> (-2.97%) |
:arrow_down: |
| ...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java | 81.72% <79.06%> (+1.06%) |
:arrow_up: |
@eolivelli @lhotari
If you are not working on this PR anymore, I will take over this fix tomorrow.
/pulsarbot rerun-failure-checks
@eolivelli @lhotari
If you are not working on this PR anymore, I will take over this fix tomorrow.
@poorbarcode Your review comments have been addressed. Please review again.
/pulsarbot rerun-failure-checks
@poorbarcode I'm merging this. You can create follow-up PRs to improve the situation after this has been merged if you find that important.
Another related fix to prevent OOM errors is #24655