pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][load-balancer] skip mis-configured resource usage(>100%) in load balancer

Open heesung-sohn opened this issue 3 years ago • 3 comments

Motivation

Incorrectly scaled resource load usage(cpu, memory, network usages bigger than 100%) can harm the load computation in the load balancer logics, as the load balancer computation expects all resource usages are normalized to the 100% scale.

Also, we need more logs to debug load balance issues in the production. For example, we need more logs to investigate why the load balancer does not unload the bundles to the underloaded brokers and etc.

Modifications

  • Added a fall-back logic to ignore any incorrectly scaled resource usage in the load balance computation from ThresholdShedder.
  • Also updated LeastResourceUsageWithWeight.java (used for broker assignment) to ignore such invalid resource usages.
  • Added more logs in load balance logics(ThresholdShedder and Load Report) for better debugging.

Verifying this change

  • [x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as ThresholdShedderTest

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • [ ] doc-required (Your PR needs to update docs and you will update later)

  • [x] doc-not-needed (Please explain why) This fix covers the load computation edge case from the Load Balancer.

  • [ ] doc (Your PR contains doc changes)

  • [ ] doc-complete (Docs have been already added)

heesung-sohn avatar Aug 03 '22 22:08 heesung-sohn

/pulsarbot run-failure-checks

codelipenghui avatar Aug 05 '22 06:08 codelipenghui

/pulsarbot run-failure-checks

heesung-sohn avatar Aug 10 '22 03:08 heesung-sohn

/pulsarbot rerun-failure-checks

heesung-sohn avatar Aug 10 '22 08:08 heesung-sohn

/pulsarbot rerun-failure-checks

heesung-sohn avatar Aug 10 '22 15:08 heesung-sohn

Hi @heesung-sn It looks like we have many conflicts when cherry-picking, could you please help push a PR to branch-2.9?

mattisonchao avatar Aug 25 '22 09:08 mattisonchao

Hi @heesung-sn It looks like we have many conflicts when cherry-picking, could you please help push a PR to branch-2.9?

Raised a PR: https://github.com/apache/pulsar/pull/17285

heesung-sohn avatar Aug 25 '22 19:08 heesung-sohn