karmada icon indicating copy to clipboard operation
karmada copied to clipboard

FederatedHPA: What happens when a member cluster gets disconnected?

Open kapilagrawal95 opened this issue 1 year ago • 5 comments

Hi,

As stated in title, when a member cluster in Karmada is disconnected then does FederatedHPA compute new replicas by treating the pod on the disconnected cluster as missing? If yes, then how are the missing metrics handled by federatedHPA? I went through the code and it seems that federated HPA uses the same logic as Kubernetes HPA. Is this correct?

kapilagrawal95 avatar Feb 20 '24 18:02 kapilagrawal95

cc @jwcesign

RainbowMango avatar Feb 21 '24 01:02 RainbowMango

does FederatedHPA compute new replicas by treating the pod on the disconnected cluster as missing?

Yes, like native K8s, when some pods' metrics are missing in a specific node, it will try to fill in the missing pods' data and then, calculate the desired replicas again.

https://github.com/karmada-io/karmada/blob/9250219d2c98d875847383c1d2cd12d78b7bb26c/pkg/controllers/federatedhpa/replica_calculator.go#L202C1-L214C3

	if len(missingPods) > 0 {
		if usageRatio < 1.0 {
			// on a scale-down, treat missing pods as using 100% of the resource request
			for podName := range missingPods {
				metrics[podName] = metricsclient.PodMetric{Value: targetUsage}
			}
		} else if usageRatio > 1.0 {
			// on a scale-up, treat missing pods as using 0% of the resource request
			for podName := range missingPods {
				metrics[podName] = metricsclient.PodMetric{Value: 0}
			}
		}
	}

If the exist pods' usage is above the target, the missing metrics will be filled with 0, on the contrary, filled with the target value. This behavior is trying to make the scaling behavior stable, do not cause severe fluctuations due to lost metrics.

But the problem here is: If too many pod metrics data are missing(for example, over 90%, happens easily in the multi-cluster scenario, ), it may result in an unreasonable calculation this time, ultimately leading to abnormal behavior. We may provide relevant means to address this defect in the future.

jwcesign avatar Feb 21 '24 01:02 jwcesign

Yes, that was my concern that the way HPA handles missing metrics in a multi-cluster setting can lead to unpredictable results and could lead to cluster safety/liveness issues.

kapilagrawal95 avatar Feb 21 '24 02:02 kapilagrawal95

Do you have any preliminary ideas about this issue?

We welcome any suggestions.

jwcesign avatar Feb 21 '24 02:02 jwcesign

This is part of an ongoing research project and I will get back to you as soon as we find an answer.

kapilagrawal95 avatar Feb 21 '24 02:02 kapilagrawal95