pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve][offloaders] Automatically evict Offloaded Ledgers from memory

Open eolivelli opened this issue 2 years ago • 9 comments

Motivation

ManagedLedgerImpl retains eagerly a cache of all the BookKeeper ReadHandles. In case of Offloaded ReadHandler there is kind of a memory leak, as each BlobStoreBackedInputStreamImpl retains a DirectBuffer of 1MB, in the case of a topic with terabytes of data and thousands of ledger this leads to Direct OOM errors if something tries to open all the ledgers

Modifications

Add a new background activity that evicts from memory all the Offloaded ReadHandles and release memory.

The feature is controlled by the new configuration option managedLedgerInactiveOffloadedLedgerEvictionTimeSeconds=600

Unfortunately this fix cannot fully prevent a OODM error because there is no global count and limit of the memory retained by such Handles, it allows to mitigate the problem by releasing automatically unused Ledger Handlers. The default value, 10 minutes, is very conservative, but it should work with real-world ledgers.

The worst case scenario is a topic with tens of thousands of small ledgers with a consumer that reads from the topic from the beginning, in this case the broker will open the ReadHandlers probably more quickly than the eviction process pace.

Verifying this change

This change added tests.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: https://github.com/eolivelli/pulsar/pull/22

eolivelli avatar Mar 10 '23 11:03 eolivelli

@eolivelli We should start with a proposal for a new behavior that will introduced to Pulsar, and it also introduced a new configuration.

codelipenghui avatar Mar 13 '23 23:03 codelipenghui

Since we will start the RC version of 3.0.0 on 2023-04-11, I will change the label/milestone of PR who have not been merged.

  • The PR of type feature is deferred to 3.1.0
  • The PR of type fix is deferred to 3.0.1

So drag this PR to 3.0.1

poorbarcode avatar Apr 10 '23 15:04 poorbarcode

The pr had no activity for 30 days, mark with Stale label.

github-actions[bot] avatar May 11 '23 01:05 github-actions[bot]

@eolivelli Is this something for the near term?

dave2wave avatar Jul 16 '23 19:07 dave2wave

The pr had no activity for 30 days, mark with Stale label.

github-actions[bot] avatar Aug 31 '23 01:08 github-actions[bot]

@eolivelli @dlg99 Any chance to rebase this PR?

lhotari avatar Oct 10 '24 17:10 lhotari

I have rebased this PR by merging in origin/master to this PR and resolving the conflicts.

lhotari avatar Oct 16 '24 14:10 lhotari

We already have invalidateReadHandle method to invalidate the read handles. Why should we introduce a time-based invalidation for the offloaded read handles? Or, if the change wants to introduce time-based read handle invalidation, why it can't apply to the bookkeeper read handle?

Good point @codelipenghui

lhotari avatar Oct 16 '24 15:10 lhotari

Codecov Report

Attention: Patch coverage is 81.03448% with 11 lines in your changes missing coverage. Please review.

Project coverage is 74.34%. Comparing base (bbc6224) to head (0315714). Report is 685 commits behind head on master.

Files with missing lines Patch % Lines
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 79.06% 5 Missing and 4 partials :warning:
...ache/bookkeeper/mledger/OffloadedLedgerHandle.java 0.00% 1 Missing :warning:
...oad/jcloud/impl/BlobStoreBackedReadHandleImpl.java 83.33% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #19783      +/-   ##
============================================
+ Coverage     73.57%   74.34%   +0.77%     
- Complexity    32624    34417    +1793     
============================================
  Files          1877     1944      +67     
  Lines        139502   147027    +7525     
  Branches      15299    16201     +902     
============================================
+ Hits         102638   109314    +6676     
- Misses        28908    29286     +378     
- Partials       7956     8427     +471     
Flag Coverage Δ
inttests 27.45% <48.07%> (+2.87%) :arrow_up:
systests 24.44% <57.69%> (+0.11%) :arrow_up:
unittests 73.72% <81.03%> (+0.87%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/bookkeeper/mledger/ManagedLedgerConfig.java 96.57% <100.00%> (+0.27%) :arrow_up:
...org/apache/pulsar/broker/ServiceConfiguration.java 98.98% <100.00%> (-0.42%) :arrow_down:
...rg/apache/pulsar/broker/service/BrokerService.java 83.89% <100.00%> (+3.11%) :arrow_up:
...ache/bookkeeper/mledger/OffloadedLedgerHandle.java 0.00% <0.00%> (ø)
...oad/jcloud/impl/BlobStoreBackedReadHandleImpl.java 77.18% <83.33%> (-2.97%) :arrow_down:
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 81.72% <79.06%> (+1.06%) :arrow_up:

... and 646 files with indirect coverage changes

codecov-commenter avatar Oct 16 '24 18:10 codecov-commenter

@eolivelli @lhotari

If you are not working on this PR anymore, I will take over this fix tomorrow.

poorbarcode avatar May 21 '25 09:05 poorbarcode

/pulsarbot rerun-failure-checks

lhotari avatar May 21 '25 10:05 lhotari

@eolivelli @lhotari

If you are not working on this PR anymore, I will take over this fix tomorrow.

@poorbarcode Your review comments have been addressed. Please review again.

lhotari avatar May 21 '25 10:05 lhotari

/pulsarbot rerun-failure-checks

lhotari avatar May 30 '25 15:05 lhotari

@poorbarcode I'm merging this. You can create follow-up PRs to improve the situation after this has been merged if you find that important.

lhotari avatar Jun 02 '25 07:06 lhotari

Another related fix to prevent OOM errors is #24655

lhotari avatar Aug 25 '25 08:08 lhotari