kpt-config-sync
kpt-config-sync copied to clipboard
[WIP] Configure VPA for reconciler, if enabled
- Configure VPA for reconciler deployment using annotation on
RootSync/RepoSync:
configsync.gke.io/reconciler-autoscaling-strategy: Auto- Auto - evict and recreate pods to apply recommended resource values, as needed.
- Recommend - monitor and record recommended resource values for each reconciler, but don't automatically apply them.
- Disabled - Do not apply any VPA config, and delete it if it exists with the same name as the reconciler.
- VPA disabled by default (opt-in for preview and testing)
- When VPA is enabled, set smaller resource requests/limits for smaller footprint on initial install. Adding limits helps hasten VPA adjustments by causing OOMKills, instead of waiting for the VPA to evict the pod.
- Move regular (non-VPA) defaults out of a ConfigMap and into the reconciler-manager code, next to the new VPA resource defaults. This should make them easier to keep in sync.
- test: Install VPA on kind when --vpa is specified
- test: Enable the VPA addon in GKE when creating clusters when --vpa is specified
- test: Rewrite some e2e tests to handle resource defaults
- test: Log reconciler pod resources on test failure to help debug VPA.
Design: go/config-sync-reconciler-autoscaling
Bug: b/289388701
Depends On:
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/763
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/776
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/777
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/778
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/780
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/872
- https://github.com/GoogleContainerTools/kpt-config-sync/pull/876
Notes:
- The helm-sync container doesn't seem to handle OOMKills very well. The helm CLI is being executed and exits with a killed message, but it doesn't seem to recover and scale up fast enough to avoid e2e test timeout.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
/hold
This PR is on hold until the metrics-server is fixed to be highly available. When the metrics-server in unhealthy, config sync's api discovery breaks, which makes tests fail.
curious if this is still WIP or if it's even relevant. If it is consider closing the PR and keeping the private branch or converting to draft.