AKS [BUG] Random autoscale error

Describe the bug Autoscaler stops working at random times. It goes into initialization state. I fix it by setting scale method to manual and then back to autoscale.

To Reproduce The error is random, I can't simulate it.

Run command 'kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml'

  apiVersion: v1
  data:
    status: |-
      Cluster-autoscaler status at 2024-06-26 05:17:03.13712722 +0000 UTC:
      Initializing
  kind: ConfigMap
  metadata:
    annotations:
      cluster-autoscaler.kubernetes.io/last-updated: 2024-06-26 05:17:03.13712722 +0000
        UTC
    creationTimestamp: "2024-06-26T05:17:03Z"
    name: cluster-autoscaler-status
    namespace: kube-system
    resourceVersion: "****"
    uid: ****

When I change to manual and back to autoscale it works fine.

apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2024-06-26 05:47:27.226243208 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=2 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=2 longUnregistered=0)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleUp:     NoActivity (ready=2 registered=2)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleDown:   CandidatesPresent (candidates=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:42:35.843690766 +0000 UTC m=+312.090831134

    NodeGroups:
      Name:        aks-poold4sv3-****-vmss
      Health:      Healthy (ready=1 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=1, maxSize=3))
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187

      Name:        aks-poold8lsv5-****-vmss
      Health:      Healthy (ready=1 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=10))
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleDown:   CandidatesPresent (candidates=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:42:35.843690766 +0000 UTC m=+312.090831134

kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2024-06-26 05:47:27.226243208 +0000
      UTC
  creationTimestamp: "2024-06-26T05:37:22Z"
  name: cluster-autoscaler-status
  namespace: kube-system
  resourceVersion: "***"
  uid: ****

Environment (please complete the following information):

Kubernetes version 1.29.4

Jun 26 '24 05:06 barkep

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

Jul 26 '24 18:07 microsoft-github-policy-service[bot]

Can you please file a support ticket the next time this happens and update it here.

Aug 05 '24 21:08 kevinkrp93

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @kevinkrp93

Feb 19 '25 18:02 microsoft-github-policy-service[bot]

This issue will now be closed because it hasn't had any activity for 7 days after stale. barkep feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Feb 26 '25 21:02 microsoft-github-policy-service[bot]

AKS AKS copied to clipboard

[BUG] Random autoscale error

AKS
AKS copied to clipboard