cruise-control Help wanted : Consistent rebalance proposal

trafficstars

We currently use Cruise Control to "manually" rebalance our clusters from time to time. I'm saying manually because we don't use the self-healing capabilities and other features, so we simply go to the cruise-control UI and start a rebalance from there. We actually want to increase the usage of Cruise Control and potentially enable the automatic features.

By default, we have this set of goal ordered in this way :

com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal

Beside this list and a fine tuning of the executor configuration (to slow it down a bit), the configuration of cruise control are left by default.

3 days ago, I rebalanced our cluster, execution finished correctly and this has been done in the process :

~5TB of data move
~550 replicas mouvements
and around 50 leader changes (was way more when the executor reached the leader change phase if I recall correctly)

Now today, the proposal is proposing me a rebalance, relatively similar to the first one we did a couple of days ago :

again ~5TB data move
570 replicas mouvements
47 leader changes

I didn't actually started it but the summary provide those numbers :

onDemandBalancednessScoreBefore: 69.2199911711607
onDemandBalancednessScoreAfter: 94.34779197073794

The cluster load didn't changed in between, the load on existing topics didn't changed as well. Perhaps less than 10 new really small topics, nothing unusual happened during this timeframe, cluster was stable all along. Monitor state say that almost 100% of the partitions are monitored : {"MonitorState":{"trainingPct":20.0,"trained":false,"numFlawedPartitions":0,"state":"RUNNING","numTotalPartitions":1316,"numMonitoredWindows":5,"monitoringCoveragePct":99.77203607559204,"reasonOfLatestPauseOrResume":"N/A","numValidPartitions":1312},"version":1} If I understood correctly from other issues opened here, the trainingPct and trained properties are not really used and don't take those values into account.

Now the question is, is this expected or did I missed something in the configuration that may cause this behaviour ? I would expect that after the first rebalance, the proposal would either give me a really-really small rebalance or even nothing. Is there a way to have "consistent" rebalance proposals when the load didn't changed ? Do you have a rule of thumb to decide whenever a rebalance should be triggered (either manually or automatically) ?

I first though that it could be related to the load of the cluster, but we have 2 others clusters and the behaviour seem's to be identical, even though the load model is totally different on the other two clusters.

Jan 27 '23 08:01 RossierFl

@RossierFl I came across similar behaviour, have you found any solutions?

Jun 01 '23 13:06 adyach

@efeg maybe you could clarify that as well

Jun 02 '23 10:06 adyach

@adyach

No, we still didn't found a solutions for that. In all honesty we didn't as well had time to dig deep to find a solution. For the moment, we still trigger it manually when we feel there is a need for that.

But more frequent, small and consistent rebalance would be really preferable for us. If there is a solution for that I would be all in to try it.

Jun 06 '23 08:06 RossierFl

cruise-control cruise-control copied to clipboard

Help wanted : Consistent rebalance proposal

cruise-control
cruise-control copied to clipboard