gardener icon indicating copy to clipboard operation
gardener copied to clipboard

Use non-VPA/HPA autoscaling for metrics-server

Open amshuman-kr opened this issue 4 years ago • 24 comments

How to categorize this issue?

/area control-plane auto-scaling /kind enhancement /priority 2

What would you like to be added:

Use non-VPA/HPA autoscaling for metrics-server

Why is this needed:

We are using VPA to scale metrics-server. But VPA relies on metrics API to recommend scaling. If for any reason metrics-server goes into a CrashloopBackoff, then metrics API also will be down and VPA will be powerless to scale up the metrics-server.

We should avoid this cycling dependency by using something like the addon-resizer or the cluster-proportiona-autoscaler both of which scales based on the number nodes rather than resource usage metrics.

amshuman-kr avatar May 10 '21 14:05 amshuman-kr

/assign @harishmanasa

amshuman-kr avatar May 10 '21 17:05 amshuman-kr

Nice idea, I was thinking about moving the metrics server next to the control plane in the seed cluster. It can have a vpn sidecar like the prometheus/kube-apiserver pods and talk to the shoot network. This way we can use the well known VPA/H(V)PA for the shoot's metrics server as the cycling dependency no longer exist. Could you evaluate also this approach?

A known limitation is that the APIService does not support URL/externalName service, but a ClusterIP service w/o pod selector can be used and manually control the address via Endpoint resource. An additional controller to sync the cluster IP of in-seed service to the in-shoot endpoint will be required, but this can be skipped in the evaluation phase and implemented only if we decide to go this way.

vpnachev avatar May 10 '21 20:05 vpnachev

By the way, metrics server cannot benefit from horizontal scaling as each replica always scrape all pods and nodes resource usage, see https://github.com/kubernetes-sigs/metrics-server/issues/552 for more details.

vpnachev avatar May 10 '21 20:05 vpnachev

By the way, metrics server cannot benefit from horizontal scaling as each replica always scrape all pods and nodes resource usage, see kubernetes-sigs/metrics-server#552 for more details.

Thanks! This is good information.

My inclination was towards addon-resizer which does vertical scaling. When @harishmanasa tried it out, it worked well and delta resources per node can be configure as well. But she found one limitation that addon-resizer always sets requests and limits as same regardless of the initial requests and limit configuration.

Because of this limitation, we started evaluating cluster-proportional-autoscaler. But based on the information you gave, that seems like a dead-end.

We could still live with the equal requests and limits with addon-resizer but then we have to configure the delta resources in such a way that it is enough for the metrics-server. BTW addon-resizer also has a acceptance and recommendation range concept that provides some buffer for the scaled target.

I was thinking about moving the metrics server next to the control plane in the seed cluster. It can have a vpn sidecar like the prometheus/kube-apiserver pods and talk to the shoot network. This way we can use the well known VPA/H(V)PA for the shoot's metrics server as the cycling dependency no longer exist. Could you evaluate also this approach?

Though workable, this is a larger change and would be hard to take it up as part of the current evaluation. To be clear, I am not against this approach. So, if someone else can evaluate this and if it turns out to be better, we could go with it. I was going for a quick solution, that is all.

amshuman-kr avatar May 11 '21 06:05 amshuman-kr

Nice idea, I was thinking about moving the metrics server next to the control plane in the seed cluster.

One thing to keep in mind with this approach is that the metrics-server constantly talks to all kubelets. By moving it to the Seed, we would cause some constant network traffic, but I'm not sure how much / how problematic it would be...

timebertt avatar May 12 '21 09:05 timebertt

One thing to keep in mind with this approach is that the metrics-server constantly talks to all kubelets. By moving it to the Seed, we would cause some constant network traffic, but I'm not sure how much / how problematic it would be...

Considering the complexity of moving metrics-server (including the VPN and connectivity issues), in out-of-band discussion (cross-skipper meeting), it is was decided to try the addon-resizer approach. @harishmanasa has started work on this.

amshuman-kr avatar May 12 '21 09:05 amshuman-kr

Agree that you should not use VPA to autoscale MS as this creates a circular dependency. Kubernetes main repo and GKE has used addon-resizer for from very beginning code. Using addon resizer has some limits, you need to commit to some scaling dimensions like max pods per node etc. For example manifests I linked support only 30 pods per node.

As for official MS distribution I have started to look into integrating same solution as alternative manifest. Reaching full autoscaling automation is in our future plans https://github.com/kubernetes-sigs/metrics-server/issues/627 (scalability section). But it still requires some work to make scaling up process without disturbance. That's why we want to work on scalability testing, Probing, SLOs first.

Official repo already has alternative configuration with addon-resizer enabled in https://github.com/kubernetes-sigs/metrics-server/blob/master/manifests/autoscale/patch.yaml. This can be installed using Kustomize kubectl apply -k ./manifests/autoscale.

If you are planning to work on improving MS scaling with addon-resizer it would be great to collaborate and have those improvements also available as part of official release.

serathius avatar May 14 '21 14:05 serathius

Nice idea, I was thinking about moving the metrics server next to the control plane in the seed cluster.

One thing to keep in mind with this approach is that the metrics-server constantly talks to all kubelets. By moving it to the Seed, we would cause some constant network traffic, but I'm not sure how much / how problematic it would be...

This is true, but Prometheus is doing the same. I guess in worst case it would double the network traffic. But as said, for now this approach will not be evaluated.

vpnachev avatar May 14 '21 14:05 vpnachev

Thanks a lot for the suggestions and for reaching out @serathius! The confirmation about the approach of using addon-resizer for MS is reassuring :-)

Official repo already has alternative configuration with addon-resizer enabled in https://github.com/kubernetes-sigs/metrics-server/blob/master/manifests/autoscale/patch.yaml. This can be installed using Kustomize kubectl apply -k ./manifests/autoscale.

We were trying to come up with the configuration based on addon-resizer documentation and the MS documentation that @vpnachev shared above. This alternative configuration helps as a concrete base.

If you are planning to work on improving MS scaling with addon-resizer it would be great to collaborate and have those improvements also available as part of official release.

We are still making the changes. We will observe this kind of scaling of MS in the landscapes and if we need to make any changes, we will reach out.

amshuman-kr avatar May 17 '21 04:05 amshuman-kr

While testing the changes in gardener to use the addon-resizer, we found that there is a race between the gardener-resource-manager and the addon-resizer in updating the metrics-server. The metrics-server keeps flapping because of this.

The gardener-resource-manager is already considering targets scaled by HVPA for preserving resources. But the addon-resizer would be just a sidecar container in the pod template spec of the target being scaled which is probably not a good way for the gardener-resource-manager to check for preserving resources.

Would annotation-based contract be more suitable? I.e., in this case, the metrics-server Deployment is annotated gardener-resource-manager.gardener.cloud/preserveResources: 'true' and gardener-resource-manager looks at this (perhaps, in addition to the current HVPA check) to preserve resources while reconciling.

WDYT?

cc @timebertt @vpnachev @rfranzke

amshuman-kr avatar May 27 '21 11:05 amshuman-kr

the metrics-server Deployment is annotated gardener-resource-manager.gardener.cloud/preserveResources: 'true' and gardener-resource-manager looks at this (perhaps, in addition to the current HVPA check) to preserve resources while reconciling.

Yes, sounds good to me. I don't have any better idea for this currently.

timebertt avatar May 28 '21 06:05 timebertt

@ialidzhikov Why did you reopen this issue after it closed via #4216?

rfranzke avatar Jan 14 '22 14:01 rfranzke

Because #4216 was reverted with https://github.com/gardener/gardener/pull/4405.

ialidzhikov avatar Jan 14 '22 14:01 ialidzhikov

Thanks!

rfranzke avatar Jan 14 '22 14:01 rfranzke

The agreement back then was to introduce the feature behind a feature gate to give more control to Gardener operators. With #4216 we saw that it actually needed a lot of fixes/improvements after merging the PR (few examples https://github.com/gardener/gardener/pull/4386, https://github.com/gardener/gardener/pull/4402). IMO it would definitely make sense to have this feature behind a feature gate and to roll it out gradually.

ialidzhikov avatar Jan 14 '22 14:01 ialidzhikov

https://github.com/gardener/gardener/pull/4402 looks strange. 14MiB per node seems absurd. Could you maybe describe the scalability dimensions that your are targeting?

For example MS targets https://github.com/kubernetes-sigs/metrics-server#scaling:

  • 100 nodes
  • 70 pods per node
  • 100 Deployments
  • ~7 HPA QPS

With such pod density and QPS we need:

  • 1mCore CPU per node
  • 2MiB Memory per node

serathius avatar Jan 14 '22 14:01 serathius

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close

/lifecycle stale

gardener-ci-robot avatar Apr 14 '22 17:04 gardener-ci-robot

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close

/lifecycle rotten

gardener-ci-robot avatar May 14 '22 17:05 gardener-ci-robot

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

gardener-ci-robot avatar Jun 13 '22 17:06 gardener-ci-robot

@gardener-ci-robot: Closing this issue.

In response to this:

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gardener-prow[bot] avatar Jun 13 '22 17:06 gardener-prow[bot]

/reopen

ialidzhikov avatar Jul 20 '22 15:07 ialidzhikov

@ialidzhikov: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gardener-prow[bot] avatar Jul 20 '22 15:07 gardener-prow[bot]

/assign

dimitar-kostadinov avatar Jul 25 '22 12:07 dimitar-kostadinov

Currently, the recommended version of addon-resizer is 1.8.15 and it scales up and down containers resource requests and limits. The requests and limits values are calculated based on number of Nodes and the recommended metrics-server configuration in Kubernetes is:

requests & limits for CPU = 40m + 0.5m per node
requests & limits for Memory = 40Mi + 4Mi per node

As the memory usage of metrics-server depends on number of containers, the addon-resizer may under or overestimate requests and limits and potentially cause OOM kill or redundant memory reservation. There are existing clusters from our landscapes where the following can be observed:

Number of nodes Number of containers Memory usage Addon-resizer memory target
385 12585 398Mi 1580Mi
334 16761 493Mi 1376Mi
270 10002 310Mi 1120Mi
2 124 51Mi 48Mi
3 83 56Mi 52Mi

On the other hand, when the metrics-server is under VPA and only requests values are controlled, with no CPU limits and fixed memory limits of 1Gi, the recommended memory target is much more precise, avoiding risk of under/overestimation. The cyclic dependency problem described in the issue is not very likely to occur because there is no CPU limit and memory limit is set to 1Gi. In the past the metrics-server VPA was also running with controlledValues: RequestsAndLimits which means that it was also downscaling the limit. After https://github.com/gardener/gardener/pull/5638 we run metrics-server with controlledValues: RequestsOnly which means that VPA won't modify the limits. In short, with addon-resizer it is not easy to find a configuration that would perfectly match each and every cluster size - the metrics-server resource usage rather depends on the number of containers than the number of Nodes. That's why for now we rather tend to keep the current scaling mechanism with VPA.

Feel free to comment if you have suggestion/objections.

dimitar-kostadinov avatar Aug 31 '22 12:08 dimitar-kostadinov

/close as the reasoning is provided in https://github.com/gardener/gardener/issues/4018#issuecomment-1232887290

ialidzhikov avatar Oct 04 '22 06:10 ialidzhikov

@ialidzhikov: Closing this issue.

In response to this:

/close as the reasoning is provided in https://github.com/gardener/gardener/issues/4018#issuecomment-1232887290

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gardener-prow[bot] avatar Oct 04 '22 06:10 gardener-prow[bot]