machine-controller-manager [Feature] Support multiple machine types for a single worker pool

[Feature] Support multiple machine types for a single worker pool

Open unmarshall opened this issue 1 year ago • 1 comments

How to categorize this issue?

/area control-plane /area usability /kind enhancement /priority 3

What would you like to be added: It is only possible today to define a single machine type per worker pool. This is then translated to one or more MachineDeployment (one per zone per worker pool) where each MachineDeployment inherits the same machine type. If the consumer wishes to have a choice of a fallback machine type then it needs to create another worker pool with a different machine type and in addition can optionally also define a priority-expander to set priorities for machine types. This presents some challenges today:

Creation of additional worker pools just to define fallback VMs is wasteful w.r.t capacity planning - [impacts gardener]. It was also pointed out that this is not very convenient for the consumers.
@dguendisch mentioned that consumers get the regular expressions wrong in the priority expander leading to un-expected decisions made by autoscaler (as per the customer). - [impacts the customer]. -> we can allow customers to only provide regular expressions through shoot YAML, and not create cluster-autoscaler-priority-expander configmap
There is an open issue which causes large backoffs by the cluster autoscaler also in the event that a specific machine quota has exhausted. We have identified a fix and will be working on it with priority.

There is a new ask to allow the consumer to specify multiple machine types per worker pool (ordered based on priority) and let MCM handle the responsibility of ensuring that the fallback machine type (in the ordered list) is selected if none of the machine types above it are available.

Why is this needed: Improves the usability from consumers perspective.

Mar 16 '23 11:03 unmarshall

** MOM (16 March )** Attendees: @dkistner, @himanshu-kun , @dguendisch, @unmarshall

The requirement and the need for it was discussed which is now captured as part of this ticket (see above). There were several approaches discussed. Few points that should be kept in mind / need to think about (from the discussion):

Cluster autoscaler requires that each node group has a single machine type associated to it. Therefore how will one translate multiple machine types per worker pool to node groups is an open point.
Cluster autoscaler already provides priority expander. Therefore one should be careful of providing a duplicate functionality in MCM. There is already quite an overlap between MCM and CA which today results in a lot of race conditions leading to un-deterministic outcomes.
If we disallow creation of PriorityExpander ConfigMap (via a Validating webhook) and only do this via MCM then it is possible that the next-in-line available machine type never satisfies the resource requirements for a pending pod. This results in pod to remain unscheduled and MCM will keep on launching new machines. Under-utilised machines will again be brought down by CA resulting in a cycle.
Leveraging CA makes more sense as it will not launch machines of a type (fallback) if they cannot meet the resource requirements of unscheduled pods.
- This can be done by allowing customer to add multiple machine types for the worker pool on the shoot Yaml . But now the mapping changes from the way it is now, like having one machine deployment is now a tuple of (workerpool,zone,type). Then we could define priorities as per the list of machineTypes per workerpool and create a priority configmap.
If fallback machines are launched for a customer and later on if the desired machine types (first in the ordered list) are now available then would rebalancing be done (an ask from @dkistner)? This was identified as complicated and CA also does not offer it out of the box today. Still need to think on the merits of this requirement.
A way to disallow customer from using priority configmap is discussed, where the configmap could only be in the shoot namespace of the seed where autoscaler is deployed (unlike today, where it is present in kube-system ns of the shoot, and customer is free to edit/create one)

Please add anything that i might have missed.

Mar 16 '23 11:03 unmarshall

machine-controller-manager machine-controller-manager copied to clipboard

[Feature] Support multiple machine types for a single worker pool

machine-controller-manager
machine-controller-manager copied to clipboard