metrics-server feat(helm chart): Make env vars configurable and auto configure go runtime

What this PR does / why we need it: PR extends the metrics-server helm chart with support for configuring environment variables and it automatically configures Go runtime (GOMAXPROCS and GOMEMLIMIT) to make the runtime aware of resources assigned to the metrics-server and addon-resizer containers. This reduces likelihood of CPUThrottlingHigh paging and OOM crashes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #

Jan 26 '24 12:01 sslavic

Welcome @sslavic!

It looks like this is your first PR to kubernetes-sigs/metrics-server 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/metrics-server has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Jan 26 '24 12:01 k8s-ci-robot

Hi @sslavic. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 26 '24 12:01 k8s-ci-robot

@stevehipwell PTAL

Jan 29 '24 22:01 sslavic

Thanks @sslavic, this looks good to me in principal.

@serathius can you see any issue with setting GOMAXPROCS & GOMEMLIMIT by default?

/ok-to-test

Jan 30 '24 10:01 stevehipwell

Could we please move this forward to unblock 0.7.0 chart release?

Feb 07 '24 14:02 asychev

/triage accepted /assign @stevehipwell

Feb 08 '24 18:02 dgrisonnet

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sslavic Once this PR has been reviewed and has the lgtm label, please ask for approval from stevehipwell. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

charts/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 08 '24 19:02 k8s-ci-robot

@dgrisonnet I'm happy from the Helm perspective but I'd like a second opinion form one of the core maintainers.

@sslavic could you add an entry under the [UNRELEASED] section in the Helm chart CHANGELOG covering what you've done?

Feb 14 '24 10:02 stevehipwell

Chart CHANGELOG has been updated, @stevehipwell PTAL

Feb 14 '24 12:02 sslavic

@serathius can you see any issue with setting GOMAXPROCS & GOMEMLIMIT by default?

Hard to say, I haven't seen those variables used anywhere in Kubernetes ecosystem, maybe because lack of awareness, maybe because it doesn't bring any tangible benefit. I would not recommend making this a default setting without prior testing.

Feb 15 '24 09:02 serathius

Looking at https://github.com/golang/go/issues/33803, it proposes to set GOMAXPROCS=max(1, floor(cpu_quota)). Implying that what PR proposes is not very good.

Feb 15 '24 09:02 serathius

what PR proposes is not very good

@serathius can you please expand on this? Please also take into consideration and compare the tradeoffs involved against current state where Go runtime for metrics-server (and sidecar) is left to defaults which e.g. for GKE managed metrics-server results in lots of CPUThrottlingHigh paging.

This PR is workaround for Go runtime issue https://github.com/golang/go/issues/33803

Many projects use https://github.com/uber-go/automaxprocs as workaround.

automaxprocs uses CPU limits. CPU limits are not typically set on containers in k8s (and rightfully so, e.g. see https://home.robusta.dev/blog/stop-using-cpu-limits). metrics-server default assigned resources also don't set limits. Therefore, this PR auto-configures Go runtime based on CPU requests.

Using automaxprocs would be even more invasive, compared to using approach this PR proposes.

GOMEMLIMIT is very useful too, but relatively new - can't expect many project to be using it at this point.

The new defaults can be opted out completely or tuned, by

adjusting CPU requests / memory limits, and/or by
disabling automatic tuning and configuring additional environment variables to their liking.

IMO these new defaults make metrics-server better out of the box, reducing the chance of CPUThrottlingHigh paging. Hope is GKE managed metrics-server will have this change propagated to it too 🤞🏻

Feb 15 '24 10:02 sslavic

@serathius is there a reason why MS couldn't use uber-go/automaxprocs?

Feb 15 '24 19:02 stevehipwell

automaxprocs is based on CPU limits, so in case of metrics-server which has no limits by default it wouldn't change anything - that is good in backward compatibility perspective, but not for the goal which is to reduce chance of the CPU throttling high issue by default out of the box. Btw there are articles devoted to this issue, see https://github.com/robusta-dev/alert-explanations/wiki/CPUThrottlingHigh-on-metrics-server-(Prometheus-alert) by @aantn - IMO it's not good that e.g. on GKE the only option is to silence the alert and let metrics-server misbehave, live with its Go runtime not being configured.

Using automaxprocs has another downside compared to the solution proposed in the PR - it can't be as easily opted out, we'd need at least extra env vars support for that; even then it wouldn't be as effective e.g. when it comes to ease of propagating the high CPU throttling fix by default even to the managed metrics-server services like the one on GKE.

Feb 16 '24 13:02 sslavic

Uber may open-source automaxprocs equivalent for GOMEMLIMIT https://github.com/uber-go/automaxprocs/issues/56#issuecomment-1381005231

I still think env vars calculated from resources assigned in the infra code is more transparent, lighter weight, less invasive and more flexible when compared to using the libraries.

Feb 16 '24 14:02 sslavic

@serathius is there a reason why MS couldn't use uber-go/automaxprocs?

No, just someone needs to test it, compile results, show improvement, and send a PR.

@serathius can you please expand on this?

Just that proposed solution is not a complete fix and without a tests showing an improvement we should not enable it by default.

My suggestion would be to keep MS components and helm releases consistent. If we want to add envs in helm, I would recommend not making Go ens default, but wait for the binary to test and adopt https://github.com/uber-go/automaxprocs

Feb 16 '24 15:02 serathius

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 04 '24 23:04 k8s-ci-robot

I ran into this while proposing similar changes in other Helm charts, inspired by https://github.com/traefik/traefik-helm-chart/pull/1029. I'm definitely not the expert on whether or not these changes are actually beneficial, but on that PR there are a series of references that look quite promising to me. Maybe they can help in understanding the possible benefits of merging this?

Apr 30 '24 15:04 jnoordsij

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 29 '24 16:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 28 '24 16:08 k8s-triage-robot