FederatedHPA: What happens when a member cluster gets disconnected?
Hi,
As stated in title, when a member cluster in Karmada is disconnected then does FederatedHPA compute new replicas by treating the pod on the disconnected cluster as missing? If yes, then how are the missing metrics handled by federatedHPA? I went through the code and it seems that federated HPA uses the same logic as Kubernetes HPA. Is this correct?
cc @jwcesign
does FederatedHPA compute new replicas by treating the pod on the disconnected cluster as missing?
Yes, like native K8s, when some pods' metrics are missing in a specific node, it will try to fill in the missing pods' data and then, calculate the desired replicas again.
https://github.com/karmada-io/karmada/blob/9250219d2c98d875847383c1d2cd12d78b7bb26c/pkg/controllers/federatedhpa/replica_calculator.go#L202C1-L214C3
if len(missingPods) > 0 {
if usageRatio < 1.0 {
// on a scale-down, treat missing pods as using 100% of the resource request
for podName := range missingPods {
metrics[podName] = metricsclient.PodMetric{Value: targetUsage}
}
} else if usageRatio > 1.0 {
// on a scale-up, treat missing pods as using 0% of the resource request
for podName := range missingPods {
metrics[podName] = metricsclient.PodMetric{Value: 0}
}
}
}
If the exist pods' usage is above the target, the missing metrics will be filled with 0, on the contrary, filled with the target value. This behavior is trying to make the scaling behavior stable, do not cause severe fluctuations due to lost metrics.
But the problem here is: If too many pod metrics data are missing(for example, over 90%, happens easily in the multi-cluster scenario, ), it may result in an unreasonable calculation this time, ultimately leading to abnormal behavior. We may provide relevant means to address this defect in the future.
Yes, that was my concern that the way HPA handles missing metrics in a multi-cluster setting can lead to unpredictable results and could lead to cluster safety/liveness issues.
Do you have any preliminary ideas about this issue?
We welcome any suggestions.
This is part of an ongoing research project and I will get back to you as soon as we find an answer.