autoscaler VPA - The ratio between CPU and memory should be maintained

Which component are you using?: vertical-pod-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: On applications using garbage collected memory, the increase of CPU could be the consequence of an intensive activity of the GC. This happens in Go and Java when using some memory primitive and settings like SetMemoryLimit or -Xmx. The CPU would increase but the memory will stay close to the limit. The value given to limit the memory usage is usually a significant part of what the application can read from the CGroup. For this reason it is interesting to grow the Memory in the same proportion than the CPU.

Describe the solution you'd like.:

Be able to constrain the VPA to maintain the ratio between the CPU and the Memory. In a more generic and general way allow the user to define a constraint so that the ratio between resource Type A and resource Type B is maintained (same proportion than in the initial user request and limit).

API Proposition: Add one more field in the ContainerResourcePolicy that would contain the definition for the ratio to be maintained: maintainRatios *[][2]v1.ResourceName

For example a user will be able to define:

maintainRatios: {{"cpu","memory"}}

In that case the memory is calculated based on the cpu recommendation, by applying the original ratio on the pod spec.

Since the user we be able to add multiple contrains we will have to ensure that the set of constrains can be represented as a set of direct acyclic graph. For example, the validation would reject because it introduce a cycle:

maintainRatios: {{"cpu","memory"},{"memory","storage"},{"storage","cpu"}}

If this feature is used, the resources measurements will be ignored for all the resources that are not root in the graph. For example with:

maintainRatios: {{"cpu","memory"}}

the measurements of memory will be ignored, the value will be calculated.

Describe any alternative solutions you've considered.:

If the memory is set as described in the feature description, the VPA is not giving any acceptable result with the application. So the application owner has to detect and understand the case and manually bump the memory request.

Additional context.:

We would like to use this feature at Datadog on top of several applications. We are happy to come and contribute to the project to integrate that feature if it makes sense for the community.

Sep 26 '22 13:09 dbenque

I'd like to make sure if I understand the issue correctly.

The goal here is to support Java / Golang applications (or at least some of their configurations). And keeping constant memory / CPU ratio is idea for how to reach that goal?

I think we did some thinking on support for Java applications already. If that's what you need I'll look up older discussions about supporting Java.

Sep 29 '22 07:09 jbartosik

#5029 is another issue about improving support for Java

Sep 29 '22 10:09 jbartosik

This could help for any language that allow the developper to put constraints on memory usage (not only java).

Also we wanted to make that feature generic enough to work with other resource than CPU and Memory. We (at datadog) are looking at making reservation for other resources like network and storage. I am anticipating maybe a bit, but I think we will have the need for some processes to grow network reservation linearly with CPU reservation. With this feature the user will be able to define a "ratio" between CPU and network bandwidth request and ask the VPA to maintain that ratio as the CPU is scaled up/down.

Sep 29 '22 12:09 dbenque

A very common pattern for applications is to have a worker pool sized in relation to the number of available CPUs (for example a thread pool that has CPU * N threads). Then memory (or sotrage) usually act as working space for those workers. Usually those applications have back-pressure mechanisms that make sure you don't have too many requests queued per worker, this effectively cap the memory per worker.

If an application like that (which again, is pretty common) moves from 1 CPU to 2 CPU (that's going to be another need, being able to round up recommendations), it will also automatically require more memory, and the developers knows it .. but the VPA will have no idea, since the backpressure will artificially limit the amount of used memory (which is a good thing ! no OOMS !). This gives you a good example why "cloud native" high performance applications won't really give hints about their memory requirements.

Now it's worse with:

GCed language that can limit further their memory requirements (SetMemoryLimit() in go, -Xmx in Java). The used memory will always be limited to a fraction of what has been requested (which is good, and necessary because you might have some off heaps allocations too, plus some VM overhead), and when you get close to this limit, it's the CPU load that will increase (to GC more to keep memory down) !
Application that manage a local cache will do the reverse, they will always want to fill up all the memory, but empty the cache on memory pressure (same as the kernel buffer cache, but managed by the application). Here those applications would always tell the VPA that they want more memory, but it isn't the case ! Here again if you have real memory pressure it's the CPU load that's going to increase (less cache = more cpu)

This feature would be necessary for all these applications that give no clear signal related to the RAM requirements, but that still require the pod shape to stay the same as we vertically upscale or downscale the CPU

Sep 29 '22 12:09 iksaif

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 28 '22 12:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 27 '23 13:01 k8s-triage-robot

/remove-lifecycle rotten

Jan 30 '23 14:01 jbartosik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 30 '23 14:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 30 '23 14:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 29 '23 15:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 29 '23 15:06 k8s-ci-robot

autoscaler autoscaler copied to clipboard

VPA - The ratio between CPU and memory should be maintained

autoscaler
autoscaler copied to clipboard