Make VPA maxAllowed configurable in the Seed and Garden resource specs
How to categorize this PR?
/area auto-scaling /kind enhancement
What this PR does / why we need it: This PR is a rework of https://github.com/gardener/gardener/pull/10208.
For components running in the Seed cluster, we would like to prevent VPA from recommending more than the largest Nodes allocatable. This is a know limitation of VPA.
This PR is a first step towards the final goal. For the final goal we would like to have all VPAs covered. This PR is currently covering the components at highest risk: etcd, apiservers and prometheus.
As a next step, we would like to evaluate different approaches for covering all VPAs:
- Leverage a webhook in the Seed that mutates maxAllowed of all VPAs
- vpa-recommender: Make max allowed recommendation configurable
- Again upstream feature request, implement in the upstream what GKE have: vpa-updater and vpa-admission-controller talking to the cluster-autoscaler via the ProvisionnngRequest API of cluster-autoscaler.
Which issue(s) this PR fixes: N/A
Special notes for your reviewer: N/A
Release note:
The VerticalPodAutoscaler's maximum allowed recommendation is now configurable in the Seed and Garden resource. The reasoning for this feature is a [know limitation](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/README.md#known-limitations) where VPA is unaware of the largest Nodes' allocatable resources and can wrongly recommend resource requests which make the Pod unschedulable. To mitigate this issue the Seed and Garden resources now support configured VPA maxAllowed. Currently the effect of this field is only limited to the components at highest risk such as etcd, kube-apiserver and prometheus. In future, we would like to cover all VPAs in the Seed and garden runtime cluster to have a configurable maxAllowed that we recommend to be the largest cluster Node's allocatable resources (minus the DeamonSet Pods resource requests).
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from ialidzhikov. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
pull-gardener-unit fails with:
2024/08/28 15:19:25 Failed executing local generator: some packages had errors:
type github.com/gardener/gardener/pkg/apis/core/v1beta1.NginxIngress cannot be converted to protobuf: unable to embed type "*k8s.io/api/core/v1.ServiceExternalTrafficPolicyType" as field "externalTrafficPolicy" in "github.com/gardener/gardener/pkg/apis/core/v1beta1.NginxIngress": did not recognize the provided type
make[1]: *** [/home/prow/go/src/github.com/gardener/gardener/Makefile:198: generate] Error 1
make[1]: Leaving directory '/home/prow/go/src/github.com/gardener/gardener'
make: *** [Makefile:142: check-generate] Error 1
Locally make check-generate works for me:
% make check-generate
> Generate
>> make generate
>> make tidy
🤔
/test pull-gardener-unit
Ok, it is not failing only for this PR. Same failure in https://github.com/gardener/gardener/pull/10400#issuecomment-2316859765
@oliver-goetz can it be related to go1.23 or any other recent change?
cc @voelzmo @plkokanov @vlerenc
/assign
Ok, it is not failing only for this PR. Same failure in #10400 (comment)
@oliver-goetz can it be related to go1.23 or any other recent change?
It seems to be related to a new version of Protocol Buffers v27.4. The maintainers recreated the release yesterday (see https://github.com/gardener/gardener/pull/10416).
/test all
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:264","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 10m0s grace period","severity":"error","time":"2024-08-29T13:33:28Z"}
I rebased to latest master.
/assign
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
- After 15d of inactivity,
lifecycle/staleis applied - After 15d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 7d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Mark this PR as rotten with
/lifecycle rotten - Close this PR with
/close
/lifecycle stale
/remove-lifecycle stale
The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
- After 15d of inactivity,
lifecycle/staleis applied - After 15d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 7d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Mark this PR as rotten with
/lifecycle rotten - Close this PR with
/close
/lifecycle stale
I am wondering about the approach of "hard-coding" this into each and every component. Currently, it's only done for
etcd,kube-apiserver, andprometheus, but I guess we would like this have all components running in the seed configured with thismaxAllowedsetting, right? This is at least what the TODO statement suggests. In this light, wouldn't it make more sense to introduce a webhook ingardener-resource-managerthat handles this across the components in the seed? This way, we could easily handle components deployed via extensions as well.
The implementation plan was already prepared. In this PR I simply reworker/reopened https://github.com/gardener/gardener/pull/10208. I think we were "in a hurry" for this feature as it was causing issues on our side. I am not sure if it is still the case after we have the worker pools with the larger machine sizes. For the GRM webhook approach or the upstream feature request (https://github.com/kubernetes/autoscaler/issues/7147) - I didn't look into all details but the problem there is rather theoretical: we want the maxAllowed value to be per Pod and for Pods with many containers the question is how to distribute the Pod-level maxAllowed value to the containers.
OK, but what now? Shall we close this PR (it doesn't seem to be of much interest given its staleness) and wait for the upstream improvement? If you want to continue with the PR, I still vote for a more generic approach as explained above.
The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
- After 15d of inactivity,
lifecycle/staleis applied - After 15d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 7d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close
/lifecycle rotten
@ialidzhikov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-gardener-check-renovate-config | 3c952d8b2bc690aff378112d66e0fa69087c5ddf | link | true | /test pull-gardener-check-renovate-config |
Full PR test history. Your PR dashboard. Command help for this repository. Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
- After 15d of inactivity,
lifecycle/staleis applied - After 15d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 7d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten
/close
@gardener-ci-robot: Closed this PR.
In response to this:
The Gardener project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
- After 15d of inactivity,
lifecycle/staleis applied- After 15d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 7d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.