autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Add awareness in VPA recommender/admission plugin about worker node specs

Open iamahgoub opened this issue 1 year ago • 8 comments

Which component are you using?: vertical-pod-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

A pod can be scheduled on a wide range of EC2 instance types from different families/generations i.e. the underlying processor of these different instance types varies in power. VPA recommender/admission plugin do not take this into account i.e. the recommended value for the containers' cpu request is the same regardless of the specs of the hosting worker node -- this leads to waste or performance degradation when the pod is scheduled on an EC2 instance that is different from the one it was running on earlier.

More details about the environment:

  • EKS/AWS
  • Karpenter for provisioning worker nodes; the environment is almost entirely running on Spot instances and high degree of diversification is applied to reduce chances for falling back to on-demand instances.
  • VPA is not currently in use, but it is being evaluated; VPA mode being considered is Initial.

Describe the solution you'd like.:

VPA recommender needs to calculate the recommended cpu based on the worker node specs/EC2 instance type (e.g. for m5, the recommended cpu is 1, for m7i, it is 750m, etc.). VPA admission plugin should check the hosting worker node before mutating cpu request, and use the corresponding value calculated by the recommender.

Describe any alternative solutions you've considered.:

  • Using VPA in Auto mode to adjust the allocated cpu after scheduling and adjust to the difference in processor performance. Nevertheless, adjusting after scheduling will lead to pod restart/re-creation, and by then it may run on a different worker node. In-place pod resize will solve this problem, but it is yet to be graduated to beta in k8s 1.32.
  • Building custom controller

Additional context.:

iamahgoub avatar Oct 13 '24 10:10 iamahgoub

/area vertical-pod-autoscaler

adrianmoisey avatar Oct 14 '24 07:10 adrianmoisey

Just want to clarify, are you asking that the VPA change it's recommendation based on the type of processor in the VM?

adrianmoisey avatar Oct 14 '24 07:10 adrianmoisey

Just want to clarify, are you asking that the VPA change it's recommendation based on the type of processor in the VM?

Yes.

iamahgoub avatar Oct 14 '24 08:10 iamahgoub

I'm not sure if that's possible. The recommendation at the moment is based on history.

I don't think it's possible to predict what the type of processor will do to workloads. I also think this may change significantly based on the programming language used and what the workload is doing.

adrianmoisey avatar Oct 14 '24 08:10 adrianmoisey

How about the following: The recommender groups the historical data by processor type, and maintain a recommendation for each. At admission time, the admission plugin identify the processor type of the hosting worker node, and mutate the cpu request with the corresponding recommendation. A default recommendation is provided when there is no historical data for the identified processor type.

iamahgoub avatar Oct 14 '24 08:10 iamahgoub

At admission time, the admission plugin identify the processor type of the hosting worker node

Is this possible? My understanding is that webhooks run before persisting the Pod to etcd, meaning that the recommendation will be applied before a Pod is scheduled to a node, and therefor before knowing on which CPU it will land on.

Effectively what is needed is a way to adjust requests after the Pod is scheduled, and that seems to be where In-place pod resize comes in.

adrianmoisey avatar Oct 14 '24 09:10 adrianmoisey

Effectively what is needed is a way to adjust requests after the Pod is scheduled, and that seems to be where In-place pod resize comes in.

Yes, in-place resize would help once it is graduated (it is currently an alpha feature). It is a bit reactive though; sometime needs to elapse for the recommender to capture new data points and update the recommendation.

iamahgoub avatar Oct 14 '24 09:10 iamahgoub

sometime needs to elapse for the recommender to capture new data points and update the recommendation

I think this is the reality of the situation though. I don't see a way to predict a workload unless up-to-date history is provided.

adrianmoisey avatar Oct 14 '24 10:10 adrianmoisey

@iamahgoub I'm not entirely sure I understood the scenario which you're describing here. My understanding is that you have different vm types for your nodes and that your workload could be scheduled on any of those vm types. That would also mean that when a Pod reaches the admission-controller, it is not clear yet, where it is going to be scheduled – it is even vice-versa: the amount of resources that the admission-controller ends up putting into the PodSpec will influence where the Pod will be scheduled.

So no matter what the internals of the recommender might look like and how it computes a recommendation: You don't know upfront where the Pod is going to end up and therefore you won't be able to adjust your resources accordingly.

/close

feel free to re-open with additional information if I got things fundamentally wrong here.

voelzmo avatar Oct 28 '24 10:10 voelzmo

@voelzmo: Closing this issue.

In response to this:

@iamahgoub I'm not entirely sure I understood the scenario which you're describing here. My understanding is that you have different vm types for your nodes and that your workload could be scheduled on any of those vm types. That would also mean that when a Pod reaches the admission-controller, it is not clear yet, where it is going to be scheduled – it is even vice-versa: the amount of resources that the admission-controller ends up putting into the PodSpec will influence where the Pod will be scheduled.

So no matter what the internals of the recommender might look like and how it computes a recommendation: You don't know upfront where the Pod is going to end up and therefore you won't be able to adjust your resources accordingly.

/close

feel free to re-open with additional information if I got things fundamentally wrong here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 28 '24 10:10 k8s-ci-robot