pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve] [broker] apply topK machanism to ModularLoadManagerImpl.

Open thetumbled opened this issue 1 year ago • 4 comments

Motivation

ModularLoadManagerImpl rely on zk to store and synchronize metadata about load, which pose greate pressure on zk, threatening the stability of system. Every broker will upload its LocalBrokerData to zk, and leader broker will retrieve all LocalBrokerData from zk, generate all BundleData from each LocalBrokerData, and update all BundleData to zk.

As every bundle in the cluster corresponds to a zk node, it is common that there are thousands of zk nodes in a cluster, which results into thousands of read/update operations to zk. This will cause a lot of pressure on zk.

As All Load Shedding Algorithm pick bundles from top to bottom based on throughput/msgRate, bundles with low throughput/msgRate are rarely be selected for shedding. So that we don't need to contain these bundles in the bundle load report.

Modifications

Reuse the configuration loadBalancerMaxNumberOfBundlesInBundleLoadReport in ExtensibleLoadManager, apply the topK mechanism to ModularLoadManagerImpl.

Verifying this change

  • [x] Make sure that the change passes the CI checks.

(Please pick either of the following options) This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: https://github.com/thetumbled/pulsar/pull/52

thetumbled avatar May 21 '24 11:05 thetumbled

List some experimetal data to show the improvement. I deploy a cluster with 5 nodes, and there are 1k+ bundles in the cluster. Set topK=20, so that there will be 20*5=100 bundles to be uploaded to zk, which means we filter out 90% bundles. It is ok that we filter so many bundles as most of the them are small bundles. Following graph show the effects: image image Write latency has decreased from 200ms to single digits!

Not only do write latency and throughput decrease, but read latency and throughput also decrease significantly! Because as long as there is less writing, there is also less reading. image image

thetumbled avatar May 21 '24 11:05 thetumbled

PTAL, thanks. @Demogorgon314 @heesung-sn @codelipenghui @BewareMyPower @lhotari

thetumbled avatar May 21 '24 11:05 thetumbled

/pulsarbot rerun-failure-checks

thetumbled avatar May 22 '24 09:05 thetumbled

Codecov Report

Attention: Patch coverage is 75.00000% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 73.18%. Comparing base (bbc6224) to head (45057cb). Report is 297 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22753      +/-   ##
============================================
- Coverage     73.57%   73.18%   -0.40%     
+ Complexity    32624     2511   -30113     
============================================
  Files          1877     1889      +12     
  Lines        139502   141482    +1980     
  Branches      15299    15526     +227     
============================================
+ Hits         102638   103542     +904     
- Misses        28908    29931    +1023     
- Partials       7956     8009      +53     
Flag Coverage Δ
inttests 27.18% <2.27%> (+2.60%) :arrow_up:
systests 24.74% <65.90%> (+0.42%) :arrow_up:
unittests 72.19% <75.00%> (-0.65%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/apache/pulsar/broker/ServiceConfiguration.java 99.40% <ø> (+<0.01%) :arrow_up:
...ker/loadbalance/extensions/models/TopKBundles.java 90.78% <ø> (ø)
...roker/loadbalance/impl/ModularLoadManagerImpl.java 84.56% <90.47%> (-0.42%) :arrow_down:
.../pulsar/policies/data/loadbalancer/BundleData.java 92.85% <60.00%> (-7.15%) :arrow_down:
...cies/data/loadbalancer/TimeAverageMessageData.java 88.13% <61.11%> (-11.87%) :arrow_down:

... and 351 files with indirect coverage changes

codecov-commenter avatar May 22 '24 10:05 codecov-commenter

As the proposal PR https://github.com/apache/pulsar/pull/22765 has been merged, could you help to review and approve this PR? thanks. @BewareMyPower @heesung-sn

thetumbled avatar May 27 '24 08:05 thetumbled

/pulsarbot rerun-failure-checks

thetumbled avatar May 27 '24 10:05 thetumbled