autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

GCP: Cluster Autoscaler only discovers first MIG when multiple MIGs share prefix in different zones

Open Debraj-git opened this issue 10 months ago • 8 comments

Description: We've identified an issue with Cluster Autoscaler (CAS) where it fails to discover and manage all MIGs when multiple MIGs share the same name prefix but exist in different zones.

Version: Cluster Autoscaler: v1.27.5 Chart version: cluster-autoscaler-9.21.1 Cloud Provider: GCP (GKE)

Current Behavior:

When using auto-discovery with a prefix pattern, CAS only discovers and manages the first MIG it finds Additional MIGs with the same prefix in other zones are marked with "no node group config" Once a zone is cached for a MIG prefix, CAS never discovers similar MIGs in other zones

Expected Behavior: CAS should discover and manage all MIGs that match the specified prefix pattern, regardless of their zone.

Reproduction Steps: Have a GKE cluster with nodepools spanning multiple zones Each nodepool creates 2 MIGs with same prefix in different zones

Configured CAS with auto-discovery:

      containers:
      - command:
        - ./cluster-autoscaler
        - --cloud-provider=gce
        - --namespace=kube-system
        - --node-group-auto-discovery=mig:namePrefix=gke-gke-cluster-oa5a-enpla9up21-spot-,min=2,max=6
        - --balance-similar-node-groups=true
        - --expander=priority
        - --logtostderr=true
        - --max-node-provision-time=5m
        - --min-replica-count=0
        - --scale-down-delay-after-add=15m
        - --scale-down-delay-after-delete=5m
        - --scale-down-delay-after-failure=3m
        - --scale-down-enabled=true
        - --scale-down-unneeded-time=5m
        - --scale-down-utilization-threshold=0.7
        - --scan-interval=1m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=true
        - --stderrthreshold=info
        - --v=4

MIG with same prefix - across 2 AZ

Image
Logs:
I0127 05:19:47.609819       1 autoscaling_gce_client.go:504] found managed instance group gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-grp matching regexp ^gke-gke-cluster-oa5a-enpla9up21-spot.+
I0127 05:19:47.761732       1 mig_info_provider.go:185] Regenerating MIG instances cache for production-platform-402407/us-east4-b/gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-grp
// Only processes nodes from one MIG, ignores others
I0127 05:19:47.834646       1 pre_filtering_processor.go:67] Skipping gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-g3de - node group min size reached (current: 2, min: 2)
I0127 05:19:47.834663       1 pre_filtering_processor.go:57] Node gke-gke-cluster-oa5a-enpla9up21-spot--ba82cb9b-6v9m should not be processed by cluster autoscaler (no node group config)  //this node belongs to a different MIG in the same nodepool

Debraj-git avatar Jan 27 '25 11:01 Debraj-git

/label kind-bug

rpsadarangani avatar Jan 27 '25 15:01 rpsadarangani

@rpsadarangani: The label(s) /label kind-bug cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, ci-short, ci-extended, ci-full. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label kind-bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 27 '25 15:01 k8s-ci-robot

/area provider-gcp

rpsadarangani avatar Jan 27 '25 15:01 rpsadarangani

@rpsadarangani: The label(s) area/provider-gcp cannot be applied, because the repository doesn't have them.

In response to this:

/area provider-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 27 '25 15:01 k8s-ci-robot

/area cluster-autoscaler

adrianmoisey avatar Jan 27 '25 16:01 adrianmoisey

/area provider/gcp

Shubham82 avatar Feb 18 '25 05:02 Shubham82

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 19 '25 06:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 18 '25 07:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 18 '25 08:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 18 '25 08:07 k8s-ci-robot

Encountered the same issue. However I figured out by going through the code that the app requires --regional flag to work properly. The GCE provider will only look for MIGs in a single region (ie: the first one it figures out) otherwise.

One the --regional flag is set, a prefix for my-pool-us-central1 covers both my-pool-us-central1-a and my-pool-us-central1-b

noe-charmet avatar Aug 13 '25 03:08 noe-charmet