cruise-control icon indicating copy to clipboard operation
cruise-control copied to clipboard

Help wanted : Consistent rebalance proposal

Open RossierFl opened this issue 2 years ago • 3 comments
trafficstars

We currently use Cruise Control to "manually" rebalance our clusters from time to time. I'm saying manually because we don't use the self-healing capabilities and other features, so we simply go to the cruise-control UI and start a rebalance from there. We actually want to increase the usage of Cruise Control and potentially enable the automatic features.

By default, we have this set of goal ordered in this way :

  • com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal
  • com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal

Beside this list and a fine tuning of the executor configuration (to slow it down a bit), the configuration of cruise control are left by default.

3 days ago, I rebalanced our cluster, execution finished correctly and this has been done in the process :

  • ~5TB of data move
  • ~550 replicas mouvements
  • and around 50 leader changes (was way more when the executor reached the leader change phase if I recall correctly)

Now today, the proposal is proposing me a rebalance, relatively similar to the first one we did a couple of days ago :

  • again ~5TB data move
  • 570 replicas mouvements
  • 47 leader changes

I didn't actually started it but the summary provide those numbers :

  • onDemandBalancednessScoreBefore: 69.2199911711607
  • onDemandBalancednessScoreAfter: 94.34779197073794

The cluster load didn't changed in between, the load on existing topics didn't changed as well. Perhaps less than 10 new really small topics, nothing unusual happened during this timeframe, cluster was stable all along. Monitor state say that almost 100% of the partitions are monitored : {"MonitorState":{"trainingPct":20.0,"trained":false,"numFlawedPartitions":0,"state":"RUNNING","numTotalPartitions":1316,"numMonitoredWindows":5,"monitoringCoveragePct":99.77203607559204,"reasonOfLatestPauseOrResume":"N/A","numValidPartitions":1312},"version":1} If I understood correctly from other issues opened here, the trainingPct and trained properties are not really used and don't take those values into account.

Now the question is, is this expected or did I missed something in the configuration that may cause this behaviour ? I would expect that after the first rebalance, the proposal would either give me a really-really small rebalance or even nothing. Is there a way to have "consistent" rebalance proposals when the load didn't changed ? Do you have a rule of thumb to decide whenever a rebalance should be triggered (either manually or automatically) ?

I first though that it could be related to the load of the cluster, but we have 2 others clusters and the behaviour seem's to be identical, even though the load model is totally different on the other two clusters.

RossierFl avatar Jan 27 '23 08:01 RossierFl

@RossierFl I came across similar behaviour, have you found any solutions?

adyach avatar Jun 01 '23 13:06 adyach

@efeg maybe you could clarify that as well

adyach avatar Jun 02 '23 10:06 adyach

@adyach

No, we still didn't found a solutions for that. In all honesty we didn't as well had time to dig deep to find a solution. For the moment, we still trigger it manually when we feel there is a need for that.

But more frequent, small and consistent rebalance would be really preferable for us. If there is a solution for that I would be all in to try it.

RossierFl avatar Jun 06 '23 08:06 RossierFl