gardener icon indicating copy to clipboard operation
gardener copied to clipboard

Make VPA maxAllowed configurable in the Seed and Garden resource specs

Open ialidzhikov opened this issue 1 year ago • 17 comments

How to categorize this PR?

/area auto-scaling /kind enhancement

What this PR does / why we need it: This PR is a rework of https://github.com/gardener/gardener/pull/10208.

For components running in the Seed cluster, we would like to prevent VPA from recommending more than the largest Nodes allocatable. This is a know limitation of VPA.

This PR is a first step towards the final goal. For the final goal we would like to have all VPAs covered. This PR is currently covering the components at highest risk: etcd, apiservers and prometheus.

As a next step, we would like to evaluate different approaches for covering all VPAs:

  • Leverage a webhook in the Seed that mutates maxAllowed of all VPAs
  • vpa-recommender: Make max allowed recommendation configurable
  • Again upstream feature request, implement in the upstream what GKE have: vpa-updater and vpa-admission-controller talking to the cluster-autoscaler via the ProvisionnngRequest API of cluster-autoscaler.

Which issue(s) this PR fixes: N/A

Special notes for your reviewer: N/A

Release note:

The VerticalPodAutoscaler's maximum allowed recommendation is now configurable in the Seed and Garden resource. The reasoning for this feature is a [know limitation](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/README.md#known-limitations) where VPA is unaware of the largest Nodes' allocatable resources and can wrongly recommend resource requests which make the Pod unschedulable. To mitigate this issue the Seed and Garden resources now support configured VPA maxAllowed. Currently the effect of this field is only limited to the components at highest risk such as etcd, kube-apiserver and prometheus. In future, we would like to cover all VPAs in the Seed and garden runtime cluster to have a configurable maxAllowed that we recommend to be the largest cluster Node's allocatable resources (minus the DeamonSet Pods resource requests).

ialidzhikov avatar Aug 28 '24 15:08 ialidzhikov

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from ialidzhikov. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

gardener-prow[bot] avatar Aug 28 '24 15:08 gardener-prow[bot]

pull-gardener-unit fails with:

2024/08/28 15:19:25 Failed executing local generator: some packages had errors:
type github.com/gardener/gardener/pkg/apis/core/v1beta1.NginxIngress cannot be converted to protobuf: unable to embed type "*k8s.io/api/core/v1.ServiceExternalTrafficPolicyType" as field "externalTrafficPolicy" in "github.com/gardener/gardener/pkg/apis/core/v1beta1.NginxIngress": did not recognize the provided type
make[1]: *** [/home/prow/go/src/github.com/gardener/gardener/Makefile:198: generate] Error 1
make[1]: Leaving directory '/home/prow/go/src/github.com/gardener/gardener'
make: *** [Makefile:142: check-generate] Error 1

Locally make check-generate works for me:

% make check-generate
> Generate
>> make generate
>> make tidy

🤔

ialidzhikov avatar Aug 29 '24 06:08 ialidzhikov

/test pull-gardener-unit

ialidzhikov avatar Aug 29 '24 06:08 ialidzhikov

Ok, it is not failing only for this PR. Same failure in https://github.com/gardener/gardener/pull/10400#issuecomment-2316859765

@oliver-goetz can it be related to go1.23 or any other recent change?

ialidzhikov avatar Aug 29 '24 07:08 ialidzhikov

cc @voelzmo @plkokanov @vlerenc

ialidzhikov avatar Aug 29 '24 07:08 ialidzhikov

/assign

plkokanov avatar Aug 29 '24 07:08 plkokanov

Ok, it is not failing only for this PR. Same failure in #10400 (comment)

@oliver-goetz can it be related to go1.23 or any other recent change?

It seems to be related to a new version of Protocol Buffers v27.4. The maintainers recreated the release yesterday (see https://github.com/gardener/gardener/pull/10416).

oliver-goetz avatar Aug 29 '24 09:08 oliver-goetz

/test all

ialidzhikov avatar Aug 29 '24 12:08 ialidzhikov

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:264","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 10m0s grace period","severity":"error","time":"2024-08-29T13:33:28Z"}

ialidzhikov avatar Aug 29 '24 14:08 ialidzhikov

I rebased to latest master.

ialidzhikov avatar Aug 29 '24 14:08 ialidzhikov

/assign

rfranzke avatar Aug 30 '24 06:08 rfranzke

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gardener-prow[bot] avatar Sep 04 '24 14:09 gardener-prow[bot]

The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

gardener-ci-robot avatar Sep 19 '24 15:09 gardener-ci-robot

/remove-lifecycle stale

ialidzhikov avatar Sep 24 '24 17:09 ialidzhikov

The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

gardener-ci-robot avatar Oct 09 '24 17:10 gardener-ci-robot

I am wondering about the approach of "hard-coding" this into each and every component. Currently, it's only done for etcd, kube-apiserver, and prometheus, but I guess we would like this have all components running in the seed configured with this maxAllowed setting, right? This is at least what the TODO statement suggests. In this light, wouldn't it make more sense to introduce a webhook in gardener-resource-manager that handles this across the components in the seed? This way, we could easily handle components deployed via extensions as well.

The implementation plan was already prepared. In this PR I simply reworker/reopened https://github.com/gardener/gardener/pull/10208. I think we were "in a hurry" for this feature as it was causing issues on our side. I am not sure if it is still the case after we have the worker pools with the larger machine sizes. For the GRM webhook approach or the upstream feature request (https://github.com/kubernetes/autoscaler/issues/7147) - I didn't look into all details but the problem there is rather theoretical: we want the maxAllowed value to be per Pod and for Pods with many containers the question is how to distribute the Pod-level maxAllowed value to the containers.

ialidzhikov avatar Oct 14 '24 09:10 ialidzhikov

OK, but what now? Shall we close this PR (it doesn't seem to be of much interest given its staleness) and wait for the upstream improvement? If you want to continue with the PR, I still vote for a more generic approach as explained above.

rfranzke avatar Oct 15 '24 14:10 rfranzke

The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close

/lifecycle rotten

gardener-ci-robot avatar Oct 30 '24 15:10 gardener-ci-robot

@ialidzhikov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gardener-check-renovate-config 3c952d8b2bc690aff378112d66e0fa69087c5ddf link true /test pull-gardener-check-renovate-config

Full PR test history. Your PR dashboard. Command help for this repository. Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

gardener-prow[bot] avatar Oct 31 '24 10:10 gardener-prow[bot]

The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten

/close

gardener-ci-robot avatar Nov 07 '24 10:11 gardener-ci-robot

@gardener-ci-robot: Closed this PR.

In response to this:

The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gardener-prow[bot] avatar Nov 07 '24 10:11 gardener-prow[bot]