cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Cluster creation fails with an error, security group and subnet for an instance belong to different networks

Open pydctw opened this issue 3 years ago • 8 comments
trafficstars

/kind bug

What steps did you take and what happened: Creating a cluster using a ClusterClass fails and the log shows an error indicating that an instance creation failed with an error, failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks

E0408 17:14:02.788518       1 awsmachine_controller.go:497]  "msg"="unable to create instance" "error"="failed to create AWSMachine instance: failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks.\n\tstatus code: 400, request id: 3b28054f-c5e8-439c-bac3-0dda24431a27" 

AWSCluster

spec:
  network:
    subnets:
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.0.0/24
      id: subnet-0176a425f63781f71
      isPublic: false
      routeTableId: rtb-08275750c99fb2f3a
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.1.0/24
      id: subnet-0464f24a3d364523f
      isPublic: true
      natGatewayId: nat-0d185215367997610
      routeTableId: rtb-051ced5fd65ae6600
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.2.0/24
      id: subnet-07023ae2d872062bf
      isPublic: false
      routeTableId: rtb-001434b0c17f5b0f4
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.3.0/24
      id: subnet-00b92bd396eef0bf2
      isPublic: true
      natGatewayId: nat-0baa238a24de3b142
      routeTableId: rtb-0e3643d5f8b441ed9
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public

AWSMachine

spec:
  ami: {}
  cloudInit:
    secureSecretsBackend: secrets-manager
  iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
  instanceID: i-0ed41df7645f74b06
  instanceType: t3.large
  providerID: aws:///us-west-2b/i-0ed41df7645f74b06
  sshKeyName: cluster-api-provider-aws-sigs-k8s-io

While sg-0b2785eae128cccad belongs to a CAPA created VPC, subnet-8b13d7d6 belongs to a default VPC in the region. Note that subnet-8b13d7d6 is not referenced in AWSCluster or AWSMachine spec.

What did you expect to happen: Cluster creation is successful.

Anything else you would like to add: Same issue was reported by a coworker using a different ClusterClass. While he is using CAPA v1.2.0, I am using the main branch.

Also, this issue doesn't happen all the time. I've created clusters multiple times and saw the issue only a few times.

Environment:

  • Cluster-api-provider-aws version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

pydctw avatar Apr 08 '22 21:04 pydctw

/triage accepted /priority critical-urgent /milestone v1.5.1

sedefsavas avatar Apr 08 '22 22:04 sedefsavas

@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog, V1.5.1, v0.6.10, v0.7.4, v1.5.0, v1.x, v2.x]

Use /milestone clear to clear the milestone.

In response to this:

/triage accepted /priority critical-urgent /milestone v1.5.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 08 '22 22:04 k8s-ci-robot

AFAIK this issue has never being observed with e2e tests without ClusterClass, so may be triggered/related to inner-workings of ClusterClass.

If so, I will reduce the priority accordingly.

sedefsavas avatar Apr 08 '22 23:04 sedefsavas

This is the first time I've seen the error and hence agree that it is ClusterClass related.

pydctw avatar Apr 11 '22 15:04 pydctw

This was such a fascinating and difficult issue to debug.

Observations

  • The issue happens randomly. Instance creation can fail at 1st, 2nd or 3rd CP creation.
  • Cluster creation is successful most of the times and it fails with the issue sometimes.
  • A cluster that failed an e2e test due to time out while waiting for a control plane eventually created an instance and cluster became ready.

Debugging

For a failed instance creation, below is input sent to AWS API.

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "96DAC283-22A0-4195-A496-78DAA918244B",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

Compare it with an input for successful case

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "subnetId": "subnet-04069978047301fce",
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "0DD45959-4F4F-442C-9C8A-24D6B49239DA",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

The difference is that the failed case doesn't have subnetId, which makes AWS to pick a random subnet for the instance, in this case a subnet in default VPC.

Root Cause Analysis

This happens because of an already known issue, capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320

AWSCluster subnet spec oscillates between two states with ClusterClass.

  • After CAPA patched
  network:
    ...
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
      id: subnet-04069978047301fce
      isPublic: false
      routeTableId: rtb-06e5b16760a136a9b
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
      id: subnet-057e208911a7100a9
      isPublic: true
      natGatewayId: nat-02b99bb47ed11bab0
      routeTableId: rtb-0c1181c7a47238747
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
      id: subnet-0d987044191d6131a
      isPublic: false
      routeTableId: rtb-0c19e5639177973ae
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24
      id: subnet-006c42a116e38379a
      isPublic: true
      natGatewayId: nat-018176214822b0de8
      routeTableId: rtb-03e0196d18896750b
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
  • After CAPI patched
  network:
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24

This instance creation fails when AWSCluster spec's subnets is on the 2nd state, when there are subnets but without IDs. So subnet ID is empty here - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/ec2/instances.go#L340

Fixes

While the long-term solution is waiting for the fix of https://github.com/kubernetes-sigs/cluster-api/issues/6320, we can improve CAPA's subnet finding logic that assumes subnets have non-empty IDs (which has been the case)

pydctw avatar Apr 13 '22 22:04 pydctw

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 12 '22 23:07 k8s-triage-robot

/remove-lifecycle stale

pydctw avatar Jul 12 '22 23:07 pydctw

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 10 '22 23:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 10 '22 00:11 k8s-triage-robot

This should have been fixed with SSA support in CAPA.

pydctw avatar Nov 10 '22 00:11 pydctw