gardener
gardener copied to clipboard
WorkerGroup update behavior in ReplicaSet pattern
/area control-plane /area auto-scaling /area scalability /area usability
What would you like to be added:
We would like to encourage to allow optionally the workerGroups updates without that the updates are immediately applied (rolling update like a Deployment), but that the version or generation of the current configuration is stored next to the new configuration, initially scaled to 0, while the older generation stays available in the current scale (ReplicaSet update behavior).
With an optional taint it might be possible to prevent new workload to be scheduled on older nodes. Otherwise existing nodes could be continued to be used. However, every new nodes should be only using the newest configuration generation. Node groups offering older configuration might be excluded from scale-up for the cluster-autoscaler. It should also be possible to not have the older nodes tainted or cordoned so that workload can freely float through the nodes as managed by the services layer.
From application side, one needs to remove nodes from the system after a while. This could be application infrastructure and explicitly kept open from Gardener side. We think that this would be totally fine as the user disables the rolling behavior explicitly and might as well enable it again. The user can remove nodes from the system with kubectl drain
for example, or could just turn back on the rolling update behavior.
Why is this needed:
We have to manage and update different configuration dimensions with regular updates: kubelet version, OS version, worker configurations at IaaS (like IMDSv2). Every update is applied immediately. That means when the configuration is changed outside of a maintenance window, the system might encounter an unwanted customer visible unavailability.
To manage updates, we have to create a new workerGroup for every update to host manually next to the current configuration (blue/green). This either limits us to 2 versions, or we create as many workergroups as we want to have different versions in the system. This is multiplied by the rather static configuration of workerGroups.
The static configuration dimensions of worker type, worker size, etc. we maintain 30+ variants respectively as we cannot forecast which combination will be actually used at runtime. All those variants would be offered as node groups to the cluster-autoscaler, most of them scaled to 0 and only available as fallback measures or for corner-cases. This multiplied by the number of versions would easily outperform the cluster-autoscaler.
The system would scale better and would be easer to use if the replica set behavior from above is applied. We would just have the 30+ static configurations maintained and every version update of kubelet or os would create a new generation. The existing workers keep their respective configuration generation and thus, the customer would not see downtime. We could apply the change anytime. The majority of static workerGroup configurations are fallbacks and might have scale 0 at the moment of the event. Those older configurations can directly be removed from the system, which limits the available options for the cluster-autoscaler.
With an optional taint on the older workerGroup configurations we can enforce the current blue/green pattern and move any restarted workload directly to the new configurations to push for the update. This mode would not be active all the time as this this can create holes in the utilization.
To tackle security updates for only specific versions, e.g. the oldest or one in the middle which comprises a problem, the application layer could have an own logic to drain specific affected nodes from the system.
Figure 1 "Rolling Update Deployment for WorkerGroups"
sequenceDiagram
actor deploymentUpdater
deploymentUpdater ->> Shoot: update worker.kubernetes.version
activate Shoot
participant MachineSet gen1
participant node gen1
create participant MachineSet gen2
Shoot ->> MachineSet gen2: create
create participant node gen2
MachineSet gen2 ->> node gen2 : create
destroy node gen1
MachineSet gen1 ->> node gen1 : delete
destroy MachineSet gen1
MachineSet gen1 -->> Shoot : finished update
Shoot -->> deploymentUpdater : shoot healthy
deactivate Shoot
Figure 2 "ReplicaSet update for WorkerGroups"
sequenceDiagram
actor deploymentUpdater
deploymentUpdater ->> Shoot: update worker.kubernetes.version
activate Shoot
participant MachineSet gen1
participant node gen1
create participant MachineSet gen2
Shoot ->> MachineSet gen2: create
create participant node gen2
MachineSet gen2 ->> node gen2 : create
MachineSet gen1 -->> Shoot : finished update
Shoot -->> deploymentUpdater : shoot healthy
deactivate Shoot
actor updateService
destroy node gen1
updateService ->> node gen1 : drain
destroy MachineSet gen1
Shoot ->> MachineSet gen1 : delete empty outdated