cluster-api-provider-aws Regression - unable to create worker pool without specifying subnet filters

trafficstars

/kind bug

What steps did you take and what happened:

I used the following template to create a cluster:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: test-capi-oz
  namespace: default
spec:
  region: us-east-2
  network:
    vpc:
      cidrBlock: 10.50.0.0/16
    subnets:
    - availabilityZone: us-east-2a
      cidrBlock: 10.50.0.0/20
      isPublic: true
      tags:
        test-capi-oz: us-east-2a
....
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: test-capi-oz-mp-0
  namespace: default
spec:
  clusterName: test-capi-oz
  failureDomains:
    - "us-east-2a"
    - "us-east-2b"
    - "us-east-2c"
  replicas: 3
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfig
          name: test-capi-oz-mp-0
      clusterName: test-capi-oz
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AWSMachinePool
        name: test-capi-oz-mp-0
      version: v1.24.0
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachinePool
metadata:
  name: test-capi-oz-mp-0
  namespace: default
spec:
  availabilityZones:
  - us-east-2
  awsLaunchTemplate:
    iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
    instanceType: t3.large
    sshKeyName: oznt
  maxSize: 4
  minSize: 3

The master node is started, and no worker node is launched. The logs show the following error repeats:

E1214 21:34:52.883240       1 controller.go:317] controller/awsmachinepool "msg"="Reconciler error" "error"="failed to create AWSMachinePool: getting subnets for ASG: getting subnets for spec azs: getting subnets for availability zone us-east-2: no subnets found for supplied availability zone" "name"="test-capi-oz-mp-0" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSMachinePool"

What did you expect to happen:

I expected 3 worker nodes to launch in each failure domain. However, no worker node is started at all.

Anything else you would like to add:

I believe that core issue is because I didn't specify subnets. I used the default template generated following the instructions here:

export EXP_MACHINE_POOL=true
clusterctl init --infrastructure aws
clusterctl generate cluster my-cluster --kubernetes-version v1.24.0 --flavor machinepool > my-cluster.yaml

I believe the issue is caused by the removal of reading subnets from the cluster spec:

Taken from pkg/cloud/services/autoscaling/autoscalinggroup.go in https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3255/files.

As a workaround, I added the following to my cluster and template specs:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: test-capi-oz
  namespace: default
spec:
  region: us-east-2
  network:
    vpc:
      cidrBlock: 10.50.0.0/16
    subnets:
    - availabilityZone: us-east-2a
      cidrBlock: 10.50.0.0/20
      isPublic: true
      tags:
        test-capi-oz: us-east-2a
    - availabilityZone: us-east-2a
      cidrBlock: 10.50.16.0/20
      tags:
        test-capi-oz: us-east-2a
    - availabilityZone: us-east-2b
      cidrBlock: 10.50.32.0/20
      isPublic: true
      tags:
        test-capi-oz: us-east-2b
    - availabilityZone: us-east-2b
      cidrBlock: 10.50.48.0/20
      tags:
        test-capi-oz: us-east-2b
    - availabilityZone: us-east-2c
      cidrBlock: 10.50.64.0/20
      isPublic: true
      tags:
        test-capi-oz: us-east-2c
    - availabilityZone: us-east-2c
      cidrBlock: 10.50.80.0/20
      tags:
        test-capi-oz: us-east-2c
  sshKeyName: oznt

and

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachinePool
metadata:
  name: test-capi-oz-mp-0
  namespace: default
spec:
  availabilityZones:
  - us-east-2
  awsLaunchTemplate:
    iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
    instanceType: t3.large
    sshKeyName: oznt
  maxSize: 4
  minSize: 1
  subnets:
  - id: "subnet-05697af9a2aed1d9e"
  #- filters: []
  - filters:
    - name: "test-capi-oz"
      values:
       - "us-east-2a"
       - "us-east-2b"
       - "us-east-2c"

This required gathering the pieces from reading the code and the docs https://cluster-api-aws.sigs.k8s.io/topics/failure-domains/control-planes.html?highlight=cidrBlo#using-failuredomain-in-network-object-of-awsmachine

I see a few ways to fix this issue:

First, better documentation, explaining that one must add subnets to the cluster spec and AWSMachinePool. Explicitly show an example like my templates would help people following my path. Second, adding subnets to the default template. Finally, it should be considered returning the following block:

if len(subnetIDs) == 0 {
		for _, subnet := range scope.InfraCluster.Subnets() {
			subnetIDs = append(subnetIDs, subnet.ID)
		}
	}

Before calling

subnetIDs, err := s.SubnetIDs(scope)

(Alternatively, this could be added at the top of s.SubnetID).

I would be happy to contribute a PR above suggestions added.

Environment:

Cluster-api-provider-aws version: registry.k8s.io/cluster-api-aws/cluster-api-aws-controller:v1.5.2
Kubernetes version: (use kubectl version):

 $ k version -o yaml
clientVersion:
  buildDate: "2022-11-25T08:23:01Z"
  compiler: gc
  gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
  gitTreeState: archive
  gitVersion: v1.25.3
  goVersion: go1.19.2
  major: "1"
  minor: "25"
  platform: linux/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2022-10-25T19:35:11Z"
  compiler: gc
  gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
  gitTreeState: clean
  gitVersion: v1.25.3
  goVersion: go1.19.2
  major: "1"
  minor: "25"
  platform: linux/amd64

OS (e.g. from /etc/os-release): gentoo

Dec 15 '22 12:12 oz123

/triage accepted

Dec 16 '22 21:12 Skarlso

@oz123 please go ahead with a PR for this if you are interested.

Dec 22 '22 13:12 Ankitasw

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Jan 19 '24 23:01 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 18 '24 23:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 18 '24 23:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 18 '24 00:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 18 '24 00:06 k8s-ci-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Regression - unable to create worker pool without specifying subnet filters

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard