pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][broker] Deprecated aggregatePublisherStatsByProducerName config

Open heesung-sohn opened this issue 3 years ago • 2 comments

Motivation

The index-based publisher stat aggregation(configured by aggregatePublisherStatsByProducerName=false, default) can burst memory or wrongly aggregate publisher metrics if each partition stat returns a different size or order of the publisher stat list. In the worst case, if there are many partitions and publishers created and closed concurrently, the current code can create PublisherStatsImpl objects exponentially, and this can cause a high GC time or OOM.

Issue Code reference:

https://github.com/apache/pulsar/commit/2c428f7fcc740525e8120566c0272fc863ffebf1#diff-02e50674125a597f8ae3405a884590759f2fdaa10104cea511d5ea44b6ff6490R224-R247

Modifications

  • Deprecated the aggregatePublisherStatsByProducerName broker config because the default, the index-based aggregation is inherently wrong in a highly concurrent producer environment(where the order and size of the publisher stat list are not guaranteed to be the same). The publisher stats need to be aggregated by a unique key, the producer name(aggregatePublisherStatsByProducerName=true).
  • If PublisherStatsImpl.publisherName happens to be null, it will aggregate it with the stat object with "null" name. (However, this is unlikely because the latest version of the brokers generates a globally unique name If not given.)
  • Use HashMap instead List to aggregate publisher stats.
  • Removed sorting from getPublisher(), as this can be expensive for a large number of publishers.
  • Simplified the stat aggregation code.

Verifying this change

  • [x] Make sure that the change passes the CI checks.

This change added unit tests.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [x] The public API
  • [ ] The schema
  • [x] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [x] doc-required
  • [] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: https://github.com/heesung-sn/pulsar/pull/13

heesung-sohn avatar Oct 28 '22 19:10 heesung-sohn

Codecov Report

Merging #18254 (6228a5c) into master (6c65ca0) will increase coverage by 3.96%. The diff coverage is 35.90%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18254      +/-   ##
============================================
+ Coverage     34.91%   38.88%   +3.96%     
- Complexity     5707     8307    +2600     
============================================
  Files           607      685      +78     
  Lines         53396    67330   +13934     
  Branches       5712     7215    +1503     
============================================
+ Hits          18644    26180    +7536     
- Misses        32119    38132    +6013     
- Partials       2633     3018     +385     
Flag Coverage Δ
unittests 38.88% <35.90%> (+3.96%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/bookkeeper/mledger/LedgerOffloader.java 0.00% <ø> (ø)
...che/bookkeeper/mledger/LedgerOffloaderFactory.java 0.00% <ø> (ø)
...pache/bookkeeper/mledger/LedgerOffloaderStats.java 0.00% <ø> (ø)
...ookkeeper/mledger/LedgerOffloaderStatsDisable.java 0.00% <ø> (ø)
...a/org/apache/bookkeeper/mledger/ManagedCursor.java 85.71% <ø> (ø)
...che/bookkeeper/mledger/ManagedLedgerException.java 41.17% <ø> (ø)
...bookkeeper/mledger/ManagedLedgerFactoryConfig.java 86.66% <ø> (ø)
...g/apache/bookkeeper/mledger/ManagedLedgerInfo.java 22.22% <ø> (ø)
...he/bookkeeper/mledger/OffloadedLedgerMetadata.java 0.00% <ø> (ø)
...ava/org/apache/bookkeeper/mledger/ScanOutcome.java 0.00% <ø> (ø)
... and 729 more

codecov-commenter avatar Nov 01 '22 11:11 codecov-commenter

/pulsarbot rerun-failure-checks

poorbarcode avatar Nov 02 '22 04:11 poorbarcode

In my understanding, old partitioned producer clients are not aggregated by producerName completely. So, this PR introduce breaking changes to them. Partitioned producers have the same producerName between partitions as default from https://github.com/apache/pulsar/pull/10279 commit. Therefore, I kept a list-indexed aggregation for backward compatibility. https://github.com/apache/pulsar/pull/12401 I think we should not introduce breaking changes, at least between the same major versions.

In addition, I think this approach doesn't deprecate a feature but removes a feature. Users can't disable aggregatePublisherStatsByProducerName from config.

equanz avatar Nov 07 '22 10:11 equanz

Yes, this pr is not backward-compatible to deprecate the broken feature.

heesung-sohn avatar Nov 08 '22 01:11 heesung-sohn

For example, when enabling aggregatePublisherStatsByProducerName in the v2.10.1 server

% grep 'aggregatePublisherStatsByProducerName' conf/standalone.conf
aggregatePublisherStatsByProducerName=true

and create the partitioned producer to the partitioned topic by the v2.8.4 java client,

 ./bin/pulsar-perf produce -r 1 -s 10 persistent://public/default/pt
...
11:28:34.053 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Starting Pulsar perf producer with config: {
...
  "producerName" : null,
...
11:28:34.969 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-0] [standalone-0-1] Created producer on cnx [id: 0x0e5ab2e4, L:/127.0.0.1:49990 - R:localhost/127.0.0.1:6650]
11:28:34.992 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-1] [standalone-0-0] Created producer on cnx [id: 0x32a3290d, L:/127.0.0.1:49989 - R:localhost/127.0.0.1:6650]
...

then the broker returns partitioned topic stats like the one below.

% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : false
  } ],
...

However, when running with your fixes on the server side and creating a producer to a partitioned topic by the v2.8.4 java client,

% ./bin/pulsar-perf produce -r 1 -s 10 persistent://public/default/pt
...
  "producerName" : null,
...
11:30:37.289 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-1] [standalone-0-1] Created producer on cnx [id: 0x393b0652, L:/127.0.0.1:50035 - R:localhost/127.0.0.1:6650]
11:30:37.307 [pulsar-client-io-2-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://public/default/pt-partition-0] [standalone-0-0] Created producer on cnx [id: 0x9b73aa95, L:/127.0.0.1:50034 - R:localhost/127.0.0.1:6650]
...

then the broker returns partitioned topic stats like the one below. As you can see, publishers can't be aggregated by producerName even if the partitioned producer is alone in the partitioned topic. It is breaking changes.

% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-1"
  }, {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-0"
  } ],
...

I understand the current index-based aggregation procedure has an issue you mentioned in this PR but are these braking changes allowed on this project?

equanz avatar Nov 11 '22 02:11 equanz

@equanz,

Thank you for sharing the example.

Although it is not desirable, I think we could be lenient on this change in the stat API response(assuming this publishers stat struct is used for human admins only for ad-hoc checks).

If this change is not acceptable, I guess we could push this fix to the next major version.

Also, when there are thousands of producers(consumers) for a (partitioned-)topic, it is expensive to aggregate each publisher(subscriptions)'s stats on-the-fly across the brokers.

So, alternatively for the next major version, I think we could further define producers(subscriptions)' API like the below and drop the publishers and subscriptions structs from topics (partitioned-)stats returns.

pulsar-admin publishers list my-topic --last-pagination-key xyz
pulsar-admin publishers stats my-producer

# similarly for subscriptions

heesung-sohn avatar Nov 11 '22 19:11 heesung-sohn

@equanz,

To fix the issue, do you have any suggestions other than what I mentioned above?

heesung-sohn avatar Nov 14 '22 22:11 heesung-sohn

I have no other suggestions. Why don't you discuss or vote on the dev@ ML before introducing this feature?

equanz avatar Nov 15 '22 02:11 equanz

Opened a pulsar community discussion thread here.

https://lists.apache.org/thread/vofv1oz0wvzlwk4x9vk067rhkscn8bqo

heesung-sohn avatar Nov 15 '22 03:11 heesung-sohn