[fix][load-balancer] skip mis-configured resource usage(>100%) in load balancer
Motivation
Incorrectly scaled resource load usage(cpu, memory, network usages bigger than 100%) can harm the load computation in the load balancer logics, as the load balancer computation expects all resource usages are normalized to the 100% scale.
Also, we need more logs to debug load balance issues in the production. For example, we need more logs to investigate why the load balancer does not unload the bundles to the underloaded brokers and etc.
Modifications
- Added a fall-back logic to ignore any incorrectly scaled resource usage in the load balance computation from ThresholdShedder.
- Also updated LeastResourceUsageWithWeight.java (used for broker assignment) to ignore such invalid resource usages.
- Added more logs in load balance logics(ThresholdShedder and Load Report) for better debugging.
Verifying this change
- [x] Make sure that the change passes the CI checks.
This change is already covered by existing tests, such as ThresholdShedderTest
Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes
- Dependencies (does it add or upgrade a dependency): (yes / no)
- The public API: (yes / no)
- The schema: (yes / no / don't know)
- The default values of configurations: (yes / no)
- The wire protocol: (yes / no)
- The rest endpoints: (yes / no)
- The admin cli options: (yes / no)
- Anything that affects deployment: (yes / no / don't know)
Documentation
Check the box below or label this PR directly.
Need to update docs?
-
[ ]
doc-required(Your PR needs to update docs and you will update later) -
[x]
doc-not-needed(Please explain why) This fix covers the load computation edge case from the Load Balancer. -
[ ]
doc(Your PR contains doc changes) -
[ ]
doc-complete(Docs have been already added)
/pulsarbot run-failure-checks
/pulsarbot run-failure-checks
/pulsarbot rerun-failure-checks
/pulsarbot rerun-failure-checks
Hi @heesung-sn It looks like we have many conflicts when cherry-picking, could you please help push a PR to branch-2.9?
Hi @heesung-sn It looks like we have many conflicts when cherry-picking, could you please help push a PR to branch-2.9?
Raised a PR: https://github.com/apache/pulsar/pull/17285