InflightReadsLimiter - limit the memory used by reads end-to-end (from storage/cache to the write to the consumer channel)
Motivation
Broker can go out of memory due to many reads enqueued on the PersistentDispatcherMultipleConsumers dispatchMessagesThread (that is used in case of dispatcherDispatchMessagesInSubscriptionThread set to true, that is the default value) The limit of the amount of memory retained due to reads MUST take into account also the entries coming from the Cache.
When dispatcherDispatchMessagesInSubscriptionThread is false (the behaviour of Pulsar 2.10) there is some kind of natural (but still unpredictable!!) back pressure mechanism because the thread that receives the entries from BK of the cache dispatches immediately and synchronously the entries to the consumer and releases them
Modifications
- Add a new component (InflightReadsLimiter) that keeps track of the overall amount of memory retained due to inflight reads.
- Add a new configuration entry managedLedgerMaxReadsInFlightSizeInMB
- The feature is disabled by default
- Add new metrics to track the status of the broker
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: https://github.com/eolivelli/pulsar/pull/19
Broker can go out of memory due to many reads enqueued on the PersistentDispatcherMultipleConsumers dispatchMessagesThread
@eolivelli Does it only happen when the broker has many subscriptions? For one subscription, we always trigger the new read entries operation after sending messages to consumers. We only have a single active read entries operation at a time. Is it able to reproduce?
@codelipenghui you are correct, the problems arise when you have a broker with many subscriptions on the same topic (and many topics). There is no broker level guardrail at the moment. With this patch the memory used to handle outbound traffic is capped and the limit is independent from the number of active subscriptions on the broker
Codecov Report
Merging #18245 (9cc5757) into master (b31c5a6) will increase coverage by
0.98%. The diff coverage is45.65%.
@@ Coverage Diff @@
## master #18245 +/- ##
============================================
+ Coverage 46.98% 47.97% +0.98%
+ Complexity 10343 9370 -973
============================================
Files 692 613 -79
Lines 67766 58415 -9351
Branches 7259 6087 -1172
============================================
- Hits 31842 28026 -3816
+ Misses 32344 27377 -4967
+ Partials 3580 3012 -568
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 47.97% <45.65%> (+0.98%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...lsar/broker/service/RedeliveryTrackerDisabled.java | 50.00% <ø> (ø) |
|
| ...va/org/apache/pulsar/broker/service/ServerCnx.java | 49.20% <ø> (+0.51%) |
:arrow_up: |
| ...ersistentStreamingDispatcherMultipleConsumers.java | 0.00% <0.00%> (ø) |
|
| .../java/org/apache/pulsar/client/impl/ClientCnx.java | 30.16% <ø> (ø) |
|
| ...a/org/apache/pulsar/client/impl/TableViewImpl.java | 0.00% <0.00%> (ø) |
|
| ...ar/client/impl/conf/ProducerConfigurationData.java | 84.70% <ø> (-0.18%) |
:arrow_down: |
| ...va/org/apache/pulsar/client/impl/ConsumerImpl.java | 15.09% <12.50%> (+0.05%) |
:arrow_up: |
| ...sistent/PersistentDispatcherMultipleConsumers.java | 58.49% <66.66%> (+1.67%) |
:arrow_up: |
| ...ache/pulsar/broker/ManagedLedgerClientFactory.java | 62.16% <100.00%> (+1.05%) |
:arrow_up: |
| .../pulsar/broker/service/AbstractBaseDispatcher.java | 60.81% <100.00%> (+2.89%) |
:arrow_up: |
| ... and 134 more |
/pulsarbot run-failure-checks