cluster-api-provider-aws Cluster creation fails with an error, security group and subnet for an instance belong to different networks

trafficstars

/kind bug

What steps did you take and what happened: Creating a cluster using a ClusterClass fails and the log shows an error indicating that an instance creation failed with an error, failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks

E0408 17:14:02.788518       1 awsmachine_controller.go:497]  "msg"="unable to create instance" "error"="failed to create AWSMachine instance: failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks.\n\tstatus code: 400, request id: 3b28054f-c5e8-439c-bac3-0dda24431a27"

AWSCluster

spec:
  network:
    subnets:
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.0.0/24
      id: subnet-0176a425f63781f71
      isPublic: false
      routeTableId: rtb-08275750c99fb2f3a
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.1.0/24
      id: subnet-0464f24a3d364523f
      isPublic: true
      natGatewayId: nat-0d185215367997610
      routeTableId: rtb-051ced5fd65ae6600
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.2.0/24
      id: subnet-07023ae2d872062bf
      isPublic: false
      routeTableId: rtb-001434b0c17f5b0f4
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.3.0/24
      id: subnet-00b92bd396eef0bf2
      isPublic: true
      natGatewayId: nat-0baa238a24de3b142
      routeTableId: rtb-0e3643d5f8b441ed9
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public

AWSMachine

spec:
  ami: {}
  cloudInit:
    secureSecretsBackend: secrets-manager
  iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
  instanceID: i-0ed41df7645f74b06
  instanceType: t3.large
  providerID: aws:///us-west-2b/i-0ed41df7645f74b06
  sshKeyName: cluster-api-provider-aws-sigs-k8s-io

While sg-0b2785eae128cccad belongs to a CAPA created VPC, subnet-8b13d7d6 belongs to a default VPC in the region. Note that subnet-8b13d7d6 is not referenced in AWSCluster or AWSMachine spec.

What did you expect to happen: Cluster creation is successful.

Anything else you would like to add: Same issue was reported by a coworker using a different ClusterClass. While he is using CAPA v1.2.0, I am using the main branch.

Also, this issue doesn't happen all the time. I've created clusters multiple times and saw the issue only a few times.

Environment:

Cluster-api-provider-aws version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Apr 08 '22 21:04 pydctw

/triage accepted /priority critical-urgent /milestone v1.5.1

Apr 08 '22 22:04 sedefsavas

@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog, V1.5.1, v0.6.10, v0.7.4, v1.5.0, v1.x, v2.x]

Use /milestone clear to clear the milestone.

In response to this:

/triage accepted /priority critical-urgent /milestone v1.5.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 08 '22 22:04 k8s-ci-robot

AFAIK this issue has never being observed with e2e tests without ClusterClass, so may be triggered/related to inner-workings of ClusterClass.

If so, I will reduce the priority accordingly.

Apr 08 '22 23:04 sedefsavas

This is the first time I've seen the error and hence agree that it is ClusterClass related.

Apr 11 '22 15:04 pydctw

This was such a fascinating and difficult issue to debug.

Observations

The issue happens randomly. Instance creation can fail at 1st, 2nd or 3rd CP creation.
Cluster creation is successful most of the times and it fails with the issue sometimes.
A cluster that failed an e2e test due to time out while waiting for a control plane eventually created an instance and cluster became ready.

Debugging

For a failed instance creation, below is input sent to AWS API.

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "96DAC283-22A0-4195-A496-78DAA918244B",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

Compare it with an input for successful case

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "subnetId": "subnet-04069978047301fce",
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "0DD45959-4F4F-442C-9C8A-24D6B49239DA",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

The difference is that the failed case doesn't have subnetId, which makes AWS to pick a random subnet for the instance, in this case a subnet in default VPC.

Root Cause Analysis

This happens because of an already known issue, capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320

AWSCluster subnet spec oscillates between two states with ClusterClass.

After CAPA patched

  network:
    ...
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
      id: subnet-04069978047301fce
      isPublic: false
      routeTableId: rtb-06e5b16760a136a9b
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
      id: subnet-057e208911a7100a9
      isPublic: true
      natGatewayId: nat-02b99bb47ed11bab0
      routeTableId: rtb-0c1181c7a47238747
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
      id: subnet-0d987044191d6131a
      isPublic: false
      routeTableId: rtb-0c19e5639177973ae
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24
      id: subnet-006c42a116e38379a
      isPublic: true
      natGatewayId: nat-018176214822b0de8
      routeTableId: rtb-03e0196d18896750b
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public

After CAPI patched

  network:
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24

This instance creation fails when AWSCluster spec's subnets is on the 2nd state, when there are subnets but without IDs. So subnet ID is empty here - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/ec2/instances.go#L340

Fixes

While the long-term solution is waiting for the fix of https://github.com/kubernetes-sigs/cluster-api/issues/6320, we can improve CAPA's subnet finding logic that assumes subnets have non-empty IDs (which has been the case)

Apr 13 '22 22:04 pydctw

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 12 '22 23:07 k8s-triage-robot

/remove-lifecycle stale

Jul 12 '22 23:07 pydctw

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 10 '22 23:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 10 '22 00:11 k8s-triage-robot

This should have been fixed with SSA support in CAPA.

Nov 10 '22 00:11 pydctw

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Cluster creation fails with an error, security group and subnet for an instance belong to different networks

Observations

Debugging

Root Cause Analysis

Fixes

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard