[fix][broker] Deprecated aggregatePublisherStatsByProducerName config
Motivation
The index-based publisher stat aggregation(configured by aggregatePublisherStatsByProducerName=false, default) can burst memory or wrongly aggregate publisher metrics if each partition stat returns a different size or order of the publisher stat list.
In the worst case, if there are many partitions and publishers created and closed concurrently, the current code can create PublisherStatsImpl objects exponentially, and this can cause a high GC time or OOM.
Issue Code reference:
https://github.com/apache/pulsar/commit/2c428f7fcc740525e8120566c0272fc863ffebf1#diff-02e50674125a597f8ae3405a884590759f2fdaa10104cea511d5ea44b6ff6490R224-R247
Modifications
- Deprecated the
aggregatePublisherStatsByProducerNamebroker config because the default, the index-based aggregation is inherently wrong in a highly concurrent producer environment(where the order and size of the publisher stat list are not guaranteed to be the same). The publisher stats need to be aggregated by a unique key, the producer name(aggregatePublisherStatsByProducerName=true). - If PublisherStatsImpl.publisherName happens to be null, it will aggregate it with the stat object with "null" name. (However, this is unlikely because the latest version of the brokers generates a globally unique name If not given.)
- Use HashMap instead List to aggregate publisher stats.
- Removed sorting from getPublisher(), as this can be expensive for a large number of publishers.
- Simplified the stat aggregation code.
Verifying this change
- [x] Make sure that the change passes the CI checks.
This change added unit tests.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [x] The public API
- [ ] The schema
- [x] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment
Documentation
- [ ]
doc - [x]
doc-required - []
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: https://github.com/heesung-sn/pulsar/pull/13
Codecov Report
Merging #18254 (6228a5c) into master (6c65ca0) will increase coverage by
3.96%. The diff coverage is35.90%.
@@ Coverage Diff @@
## master #18254 +/- ##
============================================
+ Coverage 34.91% 38.88% +3.96%
- Complexity 5707 8307 +2600
============================================
Files 607 685 +78
Lines 53396 67330 +13934
Branches 5712 7215 +1503
============================================
+ Hits 18644 26180 +7536
- Misses 32119 38132 +6013
- Partials 2633 3018 +385
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 38.88% <35.90%> (+3.96%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...org/apache/bookkeeper/mledger/LedgerOffloader.java | 0.00% <ø> (ø) |
|
| ...che/bookkeeper/mledger/LedgerOffloaderFactory.java | 0.00% <ø> (ø) |
|
| ...pache/bookkeeper/mledger/LedgerOffloaderStats.java | 0.00% <ø> (ø) |
|
| ...ookkeeper/mledger/LedgerOffloaderStatsDisable.java | 0.00% <ø> (ø) |
|
| ...a/org/apache/bookkeeper/mledger/ManagedCursor.java | 85.71% <ø> (ø) |
|
| ...che/bookkeeper/mledger/ManagedLedgerException.java | 41.17% <ø> (ø) |
|
| ...bookkeeper/mledger/ManagedLedgerFactoryConfig.java | 86.66% <ø> (ø) |
|
| ...g/apache/bookkeeper/mledger/ManagedLedgerInfo.java | 22.22% <ø> (ø) |
|
| ...he/bookkeeper/mledger/OffloadedLedgerMetadata.java | 0.00% <ø> (ø) |
|
| ...ava/org/apache/bookkeeper/mledger/ScanOutcome.java | 0.00% <ø> (ø) |
|
| ... and 729 more |
/pulsarbot rerun-failure-checks
In my understanding, old partitioned producer clients are not aggregated by producerName completely. So, this PR introduce breaking changes to them. Partitioned producers have the same producerName between partitions as default from https://github.com/apache/pulsar/pull/10279 commit. Therefore, I kept a list-indexed aggregation for backward compatibility. https://github.com/apache/pulsar/pull/12401 I think we should not introduce breaking changes, at least between the same major versions.
In addition, I think this approach doesn't deprecate a feature but removes a feature.
Users can't disable aggregatePublisherStatsByProducerName from config.
Yes, this pr is not backward-compatible to deprecate the broken feature.
For example, when enabling aggregatePublisherStatsByProducerName in the v2.10.1 server
% grep 'aggregatePublisherStatsByProducerName' conf/standalone.conf
aggregatePublisherStatsByProducerName=true
and create the partitioned producer to the partitioned topic by the v2.8.4 java client,
./bin/pulsar-perf produce -r 1 -s 10 persistent://public/default/pt
...
11:28:34.053 [main] INFO org.apache.pulsar.testclient.PerformanceProducer - Starting Pulsar perf producer with config: {
...
"producerName" : null,
...
11:28:34.969 [pulsar-client-io-2-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-0] [standalone-0-1] Created producer on cnx [id: 0x0e5ab2e4, L:/127.0.0.1:49990 - R:localhost/127.0.0.1:6650]
11:28:34.992 [pulsar-client-io-2-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-1] [standalone-0-0] Created producer on cnx [id: 0x32a3290d, L:/127.0.0.1:49989 - R:localhost/127.0.0.1:6650]
...
then the broker returns partitioned topic stats like the one below.
% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
"publishers" : [ {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"averageMsgSize" : 0.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"supportsPartialProducer" : false
} ],
...
However, when running with your fixes on the server side and creating a producer to a partitioned topic by the v2.8.4 java client,
% ./bin/pulsar-perf produce -r 1 -s 10 persistent://public/default/pt
...
"producerName" : null,
...
11:30:37.289 [pulsar-client-io-2-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-1] [standalone-0-1] Created producer on cnx [id: 0x393b0652, L:/127.0.0.1:50035 - R:localhost/127.0.0.1:6650]
11:30:37.307 [pulsar-client-io-2-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-0] [standalone-0-0] Created producer on cnx [id: 0x9b73aa95, L:/127.0.0.1:50034 - R:localhost/127.0.0.1:6650]
...
then the broker returns partitioned topic stats like the one below. As you can see, publishers can't be aggregated by producerName even if the partitioned producer is alone in the partitioned topic. It is breaking changes.
% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
"publishers" : [ {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"averageMsgSize" : 0.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"supportsPartialProducer" : true,
"producerName" : "standalone-0-1"
}, {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"averageMsgSize" : 0.0,
"chunkedMessageRate" : 0.0,
"producerId" : 0,
"supportsPartialProducer" : true,
"producerName" : "standalone-0-0"
} ],
...
I understand the current index-based aggregation procedure has an issue you mentioned in this PR but are these braking changes allowed on this project?
@equanz,
Thank you for sharing the example.
Although it is not desirable, I think we could be lenient on this change in the stat API response(assuming this publishers stat struct is used for human admins only for ad-hoc checks).
If this change is not acceptable, I guess we could push this fix to the next major version.
Also, when there are thousands of producers(consumers) for a (partitioned-)topic, it is expensive to aggregate each publisher(subscriptions)'s stats on-the-fly across the brokers.
So, alternatively for the next major version, I think we could further define producers(subscriptions)' API like the below and drop the publishers and subscriptions structs from topics (partitioned-)stats returns.
pulsar-admin publishers list my-topic --last-pagination-key xyz
pulsar-admin publishers stats my-producer
# similarly for subscriptions
@equanz,
To fix the issue, do you have any suggestions other than what I mentioned above?
I have no other suggestions. Why don't you discuss or vote on the dev@ ML before introducing this feature?
Opened a pulsar community discussion thread here.
https://lists.apache.org/thread/vofv1oz0wvzlwk4x9vk067rhkscn8bqo