cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh

Open wmgroot opened this issue 2 years ago • 10 comments
trafficstars

/kind bug

What steps did you take and what happened: Updating the MachinePool, AWSMachinePool or KubeadmConfig resources does not trigger an instanceRefresh on the AWS ASG.

I expect that with the awsmachinepool.spec.refreshPreferences.disable left on it's default value or false, that changes to the MachinePool, AWSMachinePool, and KubeadmConfig would automatically trigger an instance refresh to rotate nodes in the pool to use the updated settings. Currently, I must manually start instance refreshes using the AWS UI or CLI in order for instances to be replaced when my specs change.

What did you expect to happen: These are the MachinePool, AWSMachinePool, and KubeadmConfig I'm working with.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  annotations:
    cluster.x-k8s.io/replicas-managed-by: external-autoscaler
  name: worker-private-efficient
spec:
  clusterName: awscmhdev2
  replicas: 2
  template:
    metadata: {}
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfig
          name: worker-private-efficient
      clusterName: awscmhdev2
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachinePool
        name: worker-private-efficient
      version: v1.24.7
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachinePool
metadata:
  annotations:
    cluster-api-provider-aws: "true"
  name: worker-private-efficient
spec:
  additionalTags:
    k8s.io/cluster-autoscaler/awscmhdev2: owned
    k8s.io/cluster-autoscaler/enabled: "true"
  awsLaunchTemplate:
    additionalSecurityGroups:
    - filters:
      - name: tag:Name
        values:
        - capi-hostports
      - name: tag:network-zone
        values:
        - qa
      - name: tag:region
        values:
        - us-east-2
    ami:
      id: ami-0f6b6efcd422b9d85
    iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
    instanceType: c5a.2xlarge
    rootVolume:
      deviceName: /dev/sda1
      encrypted: true
      iops: 16000
      size: 100
      throughput: 1000
      type: gp3
    sshKeyName: ""
  capacityRebalance: true
  defaultCoolDown: 5m0s
  maxSize: 5
  minSize: 0
  mixedInstancesPolicy:
    overrides:
    - instanceType: c5a.2xlarge
    - instanceType: m5a.2xlarge
  subnets:
  - filters:
    - name: availability-zone-id
      values:
      - use2-az1
    - name: tag:type
      values:
      - nodes
  - filters:
    - name: availability-zone-id
      values:
      - use2-az2
    - name: tag:type
      values:
      - nodes
  - filters:
    - name: availability-zone-id
      values:
      - use2-az3
    - name: tag:type
      values:
      - nodes
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
metadata:
  name: worker-private-efficient
spec:
  files:
  - content: |
      vm.max_map_count=262144
    path: /etc/sysctl.d/90-vm-max-map-count.conf
  - content: |
      fs.inotify.max_user_instances=256
    path: /etc/sysctl.d/91-fs-inotify.conf
  format: cloud-config
  joinConfiguration:
    nodeRegistration:
      kubeletExtraArgs:
        cloud-provider: aws
        eviction-hard: memory.available<500Mi,nodefs.available<10%
        kube-reserved: cpu=500m,memory=2Gi,ephemeral-storage=1Gi
        node-labels: role.node.kubernetes.io/worker=true
        protect-kernel-defaults: "true"
        system-reserved: cpu=500m,memory=1Gi,ephemeral-storage=1Gi
      name: '{{ ds.meta_data.local_hostname }}'
  preKubeadmCommands:
  - sudo systemctl restart systemd-sysctl

I have not set disable: true in my refreshPreferences in the AWSMachinePool spec.

$ kubectl explain awsmachinepool.spec.refreshPreferences.disable
KIND:     AWSMachinePool
VERSION:  infrastructure.cluster.x-k8s.io/v1beta2

FIELD:    disable <boolean>

DESCRIPTION:
     Disable, if true, disables instance refresh from triggering when new launch
     templates are detected. This is useful in scenarios where ASG nodes are
     externally managed.

This is the current state of the runcmd in the LaunchTemplate in AWS, version 1566.

runcmd:
  - "sudo systemctl restart systemd-sysctl"
  - kubeadm join --config /run/kubeadm/kubeadm-join-config.yaml  && echo success > /run/cluster-api/bootstrap-success.complete

I apply a change to add a command to the KubeadmConfig, such as this.

  preKubeadmCommands:
  - sudo systemctl restart systemd-sysctl
  - echo "hello world"

I see the LaunchTemplate, has a new version, and wait for 10 minutes.

I notice that there is no active instance refresh started for my ASG in the instance refresh tab, and my instances are still using the old LaunchTemplate version.

Environment:

  • Cluster-api-provider-aws version: v2.0.2
  • Kubernetes version: (use kubectl version):
$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.1
Kustomize Version: v4.5.7
Server Version: v1.24.7

wmgroot avatar Feb 15 '23 18:02 wmgroot

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 15 '23 18:02 k8s-ci-robot

We have experienced the same behavior using CAPA v1.5.2.

fiunchinho avatar Feb 20 '23 15:02 fiunchinho

In the current code, this seems on purpose. There are several comments, even in the machine pool controller, explicitly saying that changed user data should lead to a new launch template version, but not instance refresh. I confirmed it currently works like that.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/2354 explicitly turned off instance refresh if only user data changed. That fix is to ensure the bootstrap token gets updated by creating a new launch template version automatically.

If we change the controller to trigger instance refresh, we would most likely see that nodes get rolled over every DefaultTokenTTL = 15 * time.Minute (for some reason it's 10 minutes in my case 🤷) because whenever the bootstrap token gets refreshed, the user data changes to include the new token value:

-   path: /run/kubeadm/kubeadm-join-config.yaml
    owner: root:root
    permissions: '0640'
    content: |
      ---
      apiVersion: kubeadm.k8s.io/v1beta3
      discovery:
        bootstrapToken:
          apiServerEndpoint: blabla-apiserver.example.com:6443
          caCertHashes:
          - sha256:45db4ee54dc2ea64cd30b4066fb4ccd17e5d871ac2cffc43932f984f3e76b22a
          token: 1234u7.abcd4lsoz1rujklo    # <======================= changes frequently

# [...]

What we could do is to diff the user data without the token value. If there is an effective change, trigger instance refresh.

AndiDog avatar Apr 13 '23 11:04 AndiDog

Since we agreed in chat to try the above suggestion, I gave it a shot. Draft implementation is ready and I will open a PR once tested in a real scenario.

AndiDog avatar Apr 24 '23 18:04 AndiDog

Sounds good.

Skarlso avatar Apr 25 '23 07:04 Skarlso

My fix is working, but still any change to the KubeadmConfig (such as adding something to spec.files) only gets reconciled once the bootstrap token needs to be refreshed. From what I see, only then func (r *KubeadmConfigReconciler) rotateMachinePoolBootstrapToken would update the user data secret which will lead to machine pool nodes getting rolled out (on a relevant change i.e. not only the bootstrap token). I will invest some more time on this, since users expect that any changes (or rollbacks in the heat of an incident!) get applied immediately and not only once the bootstrap token gets refreshed (default: 15/2=7.5 minutes). We may need a patch for CAPI. And even if the secret gets updated immediately, I'm not sure if the AWSMachinePool controller reconciliation gets triggered from the secret change – so there might be another delay coming up once the first one is fixed.

AndiDog avatar Apr 25 '23 14:04 AndiDog

I wonder what the backstory around this is and why it's working the way it does. I'm pretty sure there are some valid reasons somewhere in there... :D

Skarlso avatar Apr 27 '23 08:04 Skarlso

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 19 '24 07:01 k8s-triage-robot

/remove-lifecycle stale

PR will be merged soon to fix this.

AndiDog avatar Jan 19 '24 08:01 AndiDog

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 18 '24 09:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 18 '24 09:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jun 17 '24 09:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 17 '24 09:06 k8s-ci-robot

/reopen /remove-lifecycle rotten

AndiDog avatar Sep 04 '24 07:09 AndiDog

@AndiDog: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 04 '24 07:09 k8s-ci-robot

This was actually fixed via https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4619.

/close

AndiDog avatar Sep 04 '24 07:09 AndiDog

@AndiDog: Closing this issue.

In response to this:

This was actually fixed via https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4619.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 04 '24 07:09 k8s-ci-robot