autoscaler
autoscaler copied to clipboard
VPA - The ratio between CPU and memory should be maintained
Which component are you using?: vertical-pod-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: On applications using garbage collected memory, the increase of CPU could be the consequence of an intensive activity of the GC. This happens in Go and Java when using some memory primitive and settings like SetMemoryLimit or -Xmx. The CPU would increase but the memory will stay close to the limit. The value given to limit the memory usage is usually a significant part of what the application can read from the CGroup. For this reason it is interesting to grow the Memory in the same proportion than the CPU.
Describe the solution you'd like.:
Be able to constrain the VPA to maintain the ratio between the CPU and the Memory. In a more generic and general way allow the user to define a constraint so that the ratio between resource Type A and resource Type B is maintained (same proportion than in the initial user request and limit).
API Proposition: Add one more field in the ContainerResourcePolicy that would contain the definition for the ratio to be maintained:
maintainRatios *[][2]v1.ResourceName
For example a user will be able to define:
maintainRatios: {{"cpu","memory"}}
In that case the memory is calculated based on the cpu recommendation, by applying the original ratio on the pod spec.
Since the user we be able to add multiple contrains we will have to ensure that the set of constrains can be represented as a set of direct acyclic graph. For example, the validation would reject because it introduce a cycle:
maintainRatios: {{"cpu","memory"},{"memory","storage"},{"storage","cpu"}}
If this feature is used, the resources measurements will be ignored for all the resources that are not root in the graph. For example with:
maintainRatios: {{"cpu","memory"}}
the measurements of memory will be ignored, the value will be calculated.
Describe any alternative solutions you've considered.:
If the memory is set as described in the feature description, the VPA is not giving any acceptable result with the application. So the application owner has to detect and understand the case and manually bump the memory request.
Additional context.:
We would like to use this feature at Datadog on top of several applications. We are happy to come and contribute to the project to integrate that feature if it makes sense for the community.
I'd like to make sure if I understand the issue correctly.
The goal here is to support Java / Golang applications (or at least some of their configurations). And keeping constant memory / CPU ratio is idea for how to reach that goal?
I think we did some thinking on support for Java applications already. If that's what you need I'll look up older discussions about supporting Java.
#5029 is another issue about improving support for Java
This could help for any language that allow the developper to put constraints on memory usage (not only java).
Also we wanted to make that feature generic enough to work with other resource than CPU and Memory. We (at datadog) are looking at making reservation for other resources like network and storage. I am anticipating maybe a bit, but I think we will have the need for some processes to grow network reservation linearly with CPU reservation. With this feature the user will be able to define a "ratio" between CPU and network bandwidth request and ask the VPA to maintain that ratio as the CPU is scaled up/down.
A very common pattern for applications is to have a worker pool sized in relation to the number of available CPUs (for example a thread pool that has CPU * N threads). Then memory (or sotrage) usually act as working space for those workers. Usually those applications have back-pressure mechanisms that make sure you don't have too many requests queued per worker, this effectively cap the memory per worker.
If an application like that (which again, is pretty common) moves from 1 CPU to 2 CPU (that's going to be another need, being able to round up recommendations), it will also automatically require more memory, and the developers knows it .. but the VPA will have no idea, since the backpressure will artificially limit the amount of used memory (which is a good thing ! no OOMS !). This gives you a good example why "cloud native" high performance applications won't really give hints about their memory requirements.
Now it's worse with:
- GCed language that can limit further their memory requirements (
SetMemoryLimit()in go,-Xmxin Java). The used memory will always be limited to a fraction of what has been requested (which is good, and necessary because you might have some off heaps allocations too, plus some VM overhead), and when you get close to this limit, it's the CPU load that will increase (to GC more to keep memory down) ! - Application that manage a local cache will do the reverse, they will always want to fill up all the memory, but empty the cache on memory pressure (same as the kernel buffer cache, but managed by the application). Here those applications would always tell the VPA that they want more memory, but it isn't the case ! Here again if you have real memory pressure it's the CPU load that's going to increase (less cache = more cpu)
This feature would be necessary for all these applications that give no clear signal related to the RAM requirements, but that still require the pod shape to stay the same as we vertically upscale or downscale the CPU
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.