cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

MachineSet with wrong MachineSetNameLabel value in .spec.selector spawns infinite Machines

Open damdo opened this issue 7 months ago • 8 comments

What steps did you take and what happened?

Scaling up a CAPI MachineSet with a .spec.selector.matchLabels that has a k/v clusterv1.MachineSetNameLabel (cluster.x-k8s.io/set-name) with value that doesn't match the .metadata.name of the MachineSet, causes a infinite-loop of child Machines being created.

Reproducing steps:

  • a CAPI MachineSet is created with matching .spec.selector.matchLabels and .spec.template.metadata.labels 's cluster.x-k8s.io/set-name but with its .metadata.name differing from them
  • the CAPI MachineSet is scaled up to 1 replica by the user
  • the CAPI MachineSet controller goes to create a machine, it sets the .metadata.labels value for the key clusterv1.MachineSetNameLabel in the Machine by setting it to the parent MachineSet .metadata.name, ignoring what was specified in the .spec.template.metadata.labels of the parent MachineSet
  • the CAPI Machine that gets created from the scale up has .metadata.labels differing from the one its parent MachineSet uses to select/find its children in the .spec.selector.matchingLabels
  • the CAPI machineset controller as such is unable to keep track if its children, and keeps creating new ones as the diff between owned machines and spec.replicas is always < 0

What did you expect to happen?

Only one Machine being created.

Cluster API version

Reproduced in v1.9.z, might be in all supported versions.

Kubernetes version

v1.32

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug

/area machineset /area machine

damdo avatar May 07 '25 16:05 damdo

The MachineSet webhook defaults an empty selector to include the MachineSetNameLabel with the m.Name value, but there is currently no validation that the value of that label matches m.Name

I think the fix for this issue is just to check that any label that the MachineSet would override to a specific value when creating Machines, is only allowed to be set to that specific value in the selector.

JoelSpeed avatar May 07 '25 16:05 JoelSpeed

@chrischdi @sbueringer who should I tag for triaging this? Thanks!

damdo avatar May 12 '25 16:05 damdo

The MachineSet webhook defaults an empty selector to include the MachineSetNameLabel with the m.Name value, but there is currently no validation that the value of that label matches m.Name

I think the fix for this issue is just to check that any label that the MachineSet would override to a specific value when creating Machines, is only allowed to be set to that specific value in the selector.

That sounds reasonable.

We should then also add a corresponding validation for the MachineDeployments. As addition: we should propably also check the .metadata.labels on the machines are properly enforced (for MD, MS, KCP machines)

/priority important-longterm /triage accepted

chrischdi avatar May 14 '25 13:05 chrischdi

Wondering if we should move a step further and do not expose selectors in the API 🤔

fabriziopandini avatar May 14 '25 13:05 fabriziopandini

Wondering if we should move a step further and do not expose selectors in the API 🤔

Would that mean relying solely on owner references to find the Machines owned by the MachineSet?

JoelSpeed avatar May 14 '25 14:05 JoelSpeed

/help

chrischdi avatar May 14 '25 14:05 chrischdi

@chrischdi: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 14 '25 14:05 k8s-ci-robot

Wondering if we should move a step further and do not expose selectors in the API 🤔

Would that mean relying solely on owner references to find the Machines owned by the MachineSet?

I was thinking about only relying on the labels that we always add on the Machines

sbueringer avatar May 14 '25 15:05 sbueringer

We have similar problem. If we create MachineDeployment with empty selector, and mutating webhook does not work (for example, someone deleted mutatingwebhookconfigurations.admissionregistration.k8s.io), then CAPI starts to create infinite number of MachineSet.

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  namespace: d8-cloud-instance-manager
  name: "bob-worker"
spec:
  clusterName: "bob"
  selector: {}
  minReadySeconds: 0
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 1
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        node-group: "worker"
    spec:
      clusterName: "bob"
      bootstrap:
        dataSecretName: "worker-99612327"
      infrastructureRef:
        apiVersion: "infrastructure.cluster.x-k8s.io/v1beta2"
        kind: "VCDMachineTemplate"
        name: worker-99612327
      nodeDrainTimeout: 10m
      nodeDeletionTimeout: 10m
      nodeVolumeDetachTimeout: 10m

borg-z avatar Jul 18 '25 12:07 borg-z

and mutating webhook does not work (for example, someone deleted mutatingwebhookconfigurations.admissionregistration.k8s.io)

This is something that we do not support

fabriziopandini avatar Jul 18 '25 14:07 fabriziopandini