autoscaler
autoscaler copied to clipboard
GCP: Cluster Autoscaler only discovers first MIG when multiple MIGs share prefix in different zones
Description: We've identified an issue with Cluster Autoscaler (CAS) where it fails to discover and manage all MIGs when multiple MIGs share the same name prefix but exist in different zones.
Version: Cluster Autoscaler: v1.27.5 Chart version: cluster-autoscaler-9.21.1 Cloud Provider: GCP (GKE)
Current Behavior:
When using auto-discovery with a prefix pattern, CAS only discovers and manages the first MIG it finds Additional MIGs with the same prefix in other zones are marked with "no node group config" Once a zone is cached for a MIG prefix, CAS never discovers similar MIGs in other zones
Expected Behavior: CAS should discover and manage all MIGs that match the specified prefix pattern, regardless of their zone.
Reproduction Steps: Have a GKE cluster with nodepools spanning multiple zones Each nodepool creates 2 MIGs with same prefix in different zones
Configured CAS with auto-discovery:
containers:
- command:
- ./cluster-autoscaler
- --cloud-provider=gce
- --namespace=kube-system
- --node-group-auto-discovery=mig:namePrefix=gke-gke-cluster-oa5a-enpla9up21-spot-,min=2,max=6
- --balance-similar-node-groups=true
- --expander=priority
- --logtostderr=true
- --max-node-provision-time=5m
- --min-replica-count=0
- --scale-down-delay-after-add=15m
- --scale-down-delay-after-delete=5m
- --scale-down-delay-after-failure=3m
- --scale-down-enabled=true
- --scale-down-unneeded-time=5m
- --scale-down-utilization-threshold=0.7
- --scan-interval=1m
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=true
- --stderrthreshold=info
- --v=4
MIG with same prefix - across 2 AZ
Logs:
I0127 05:19:47.609819 1 autoscaling_gce_client.go:504] found managed instance group gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-grp matching regexp ^gke-gke-cluster-oa5a-enpla9up21-spot.+
I0127 05:19:47.761732 1 mig_info_provider.go:185] Regenerating MIG instances cache for production-platform-402407/us-east4-b/gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-grp
// Only processes nodes from one MIG, ignores others
I0127 05:19:47.834646 1 pre_filtering_processor.go:67] Skipping gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-g3de - node group min size reached (current: 2, min: 2)
I0127 05:19:47.834663 1 pre_filtering_processor.go:57] Node gke-gke-cluster-oa5a-enpla9up21-spot--ba82cb9b-6v9m should not be processed by cluster autoscaler (no node group config) //this node belongs to a different MIG in the same nodepool
/label kind-bug
@rpsadarangani: The label(s) /label kind-bug cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, ci-short, ci-extended, ci-full. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?
In response to this:
/label kind-bug
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/area provider-gcp
@rpsadarangani: The label(s) area/provider-gcp cannot be applied, because the repository doesn't have them.
In response to this:
/area provider-gcp
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/area cluster-autoscaler
/area provider/gcp
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Encountered the same issue. However I figured out by going through the code that the app requires --regional flag to work properly. The GCE provider will only look for MIGs in a single region (ie: the first one it figures out) otherwise.
One the --regional flag is set, a prefix for my-pool-us-central1 covers both my-pool-us-central1-a and my-pool-us-central1-b