cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
Cluster creation fails with an error, security group and subnet for an instance belong to different networks
/kind bug
What steps did you take and what happened:
Creating a cluster using a ClusterClass fails and the log shows an error indicating that an instance creation failed with an error, failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks
E0408 17:14:02.788518 1 awsmachine_controller.go:497] "msg"="unable to create instance" "error"="failed to create AWSMachine instance: failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks.\n\tstatus code: 400, request id: 3b28054f-c5e8-439c-bac3-0dda24431a27"
AWSCluster
spec:
network:
subnets:
- availabilityZone: us-west-2a
cidrBlock: 10.0.0.0/24
id: subnet-0176a425f63781f71
isPublic: false
routeTableId: rtb-08275750c99fb2f3a
tags:
Name: cluster-ew1b45-subnet-private-us-west-2a
kubernetes.io/cluster/cluster-ew1b45: shared
kubernetes.io/role/internal-elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
sigs.k8s.io/cluster-api-provider-aws/role: private
- availabilityZone: us-west-2a
cidrBlock: 10.0.1.0/24
id: subnet-0464f24a3d364523f
isPublic: true
natGatewayId: nat-0d185215367997610
routeTableId: rtb-051ced5fd65ae6600
tags:
Name: cluster-ew1b45-subnet-public-us-west-2a
kubernetes.io/cluster/cluster-ew1b45: shared
kubernetes.io/role/elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
sigs.k8s.io/cluster-api-provider-aws/role: public
- availabilityZone: us-west-2b
cidrBlock: 10.0.2.0/24
id: subnet-07023ae2d872062bf
isPublic: false
routeTableId: rtb-001434b0c17f5b0f4
tags:
Name: cluster-ew1b45-subnet-private-us-west-2b
kubernetes.io/cluster/cluster-ew1b45: shared
kubernetes.io/role/internal-elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
sigs.k8s.io/cluster-api-provider-aws/role: private
- availabilityZone: us-west-2b
cidrBlock: 10.0.3.0/24
id: subnet-00b92bd396eef0bf2
isPublic: true
natGatewayId: nat-0baa238a24de3b142
routeTableId: rtb-0e3643d5f8b441ed9
tags:
Name: cluster-ew1b45-subnet-public-us-west-2b
kubernetes.io/cluster/cluster-ew1b45: shared
kubernetes.io/role/elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
sigs.k8s.io/cluster-api-provider-aws/role: public
AWSMachine
spec:
ami: {}
cloudInit:
secureSecretsBackend: secrets-manager
iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
instanceID: i-0ed41df7645f74b06
instanceType: t3.large
providerID: aws:///us-west-2b/i-0ed41df7645f74b06
sshKeyName: cluster-api-provider-aws-sigs-k8s-io
While sg-0b2785eae128cccad belongs to a CAPA created VPC, subnet-8b13d7d6 belongs to a default VPC in the region. Note that subnet-8b13d7d6 is not referenced in AWSCluster or AWSMachine spec.
What did you expect to happen: Cluster creation is successful.
Anything else you would like to add: Same issue was reported by a coworker using a different ClusterClass. While he is using CAPA v1.2.0, I am using the main branch.
Also, this issue doesn't happen all the time. I've created clusters multiple times and saw the issue only a few times.
Environment:
- Cluster-api-provider-aws version:
- Kubernetes version: (use
kubectl version): - OS (e.g. from
/etc/os-release):
/triage accepted /priority critical-urgent /milestone v1.5.1
@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog, V1.5.1, v0.6.10, v0.7.4, v1.5.0, v1.x, v2.x]
Use /milestone clear to clear the milestone.
In response to this:
/triage accepted /priority critical-urgent /milestone v1.5.1
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
AFAIK this issue has never being observed with e2e tests without ClusterClass, so may be triggered/related to inner-workings of ClusterClass.
If so, I will reduce the priority accordingly.
This is the first time I've seen the error and hence agree that it is ClusterClass related.
This was such a fascinating and difficult issue to debug.
Observations
- The issue happens randomly. Instance creation can fail at 1st, 2nd or 3rd CP creation.
- Cluster creation is successful most of the times and it fails with the issue sometimes.
- A cluster that failed an e2e test due to time out while waiting for a control plane eventually created an instance and cluster became ready.
Debugging
For a failed instance creation, below is input sent to AWS API.
{
"instancesSet": {
"items": [
{
"imageId": "ami-093e132cf8ec45d77",
"minCount": 1,
"maxCount": 1,
"keyName": "cluster-api-provider-aws-sigs-k8s-io"
}
]
},
"groupSet": {
"items": [
{
"groupId": "sg-07c3eb751181ac0ab"
},
{
"groupId": "sg-05683bb88ffba846b"
},
{
"groupId": "sg-08f3c5c87413f9212"
}
]
},
"userData": "<sensitiveDataRemoved>",
"instanceType": "t3.large",
"blockDeviceMapping": {},
"monitoring": {
"enabled": false
},
"disableApiTermination": false,
"disableApiStop": false,
"clientToken": "96DAC283-22A0-4195-A496-78DAA918244B",
"iamInstanceProfile": {
"name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
},
"tagSpecificationSet": {
"items": [
{
"resourceType": "instance",
"tags": [
{
"key": "MachineName",
"value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
},
{
"key": "Name",
"value": "cluster-qmul89-control-plane-n9994-2lrrt"
},
{
"key": "kubernetes.io/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/role",
"value": "control-plane"
}
]
}
]
}
}
Compare it with an input for successful case
{
"instancesSet": {
"items": [
{
"imageId": "ami-093e132cf8ec45d77",
"minCount": 1,
"maxCount": 1,
"keyName": "cluster-api-provider-aws-sigs-k8s-io"
}
]
},
"groupSet": {
"items": [
{
"groupId": "sg-07c3eb751181ac0ab"
},
{
"groupId": "sg-05683bb88ffba846b"
},
{
"groupId": "sg-08f3c5c87413f9212"
}
]
},
"userData": "<sensitiveDataRemoved>",
"instanceType": "t3.large",
"blockDeviceMapping": {},
"monitoring": {
"enabled": false
},
"subnetId": "subnet-04069978047301fce",
"disableApiTermination": false,
"disableApiStop": false,
"clientToken": "0DD45959-4F4F-442C-9C8A-24D6B49239DA",
"iamInstanceProfile": {
"name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
},
"tagSpecificationSet": {
"items": [
{
"resourceType": "instance",
"tags": [
{
"key": "MachineName",
"value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
},
{
"key": "Name",
"value": "cluster-qmul89-control-plane-n9994-2lrrt"
},
{
"key": "kubernetes.io/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/role",
"value": "control-plane"
}
]
}
]
}
}
The difference is that the failed case doesn't have subnetId, which makes AWS to pick a random subnet for the instance, in this case a subnet in default VPC.
Root Cause Analysis
This happens because of an already known issue, capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320
AWSCluster subnet spec oscillates between two states with ClusterClass.
- After CAPA patched
network:
...
subnets:
- availabilityZone: us-west-1a
cidrBlock: 10.0.0.0/24
id: subnet-04069978047301fce
isPublic: false
routeTableId: rtb-06e5b16760a136a9b
tags:
Name: cluster-qmul89-subnet-private-us-west-1a
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/internal-elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: private
- availabilityZone: us-west-1a
cidrBlock: 10.0.1.0/24
id: subnet-057e208911a7100a9
isPublic: true
natGatewayId: nat-02b99bb47ed11bab0
routeTableId: rtb-0c1181c7a47238747
tags:
Name: cluster-qmul89-subnet-public-us-west-1a
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: public
- availabilityZone: us-west-1c
cidrBlock: 10.0.2.0/24
id: subnet-0d987044191d6131a
isPublic: false
routeTableId: rtb-0c19e5639177973ae
tags:
Name: cluster-qmul89-subnet-private-us-west-1c
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/internal-elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: private
- availabilityZone: us-west-1c
cidrBlock: 10.0.3.0/24
id: subnet-006c42a116e38379a
isPublic: true
natGatewayId: nat-018176214822b0de8
routeTableId: rtb-03e0196d18896750b
tags:
Name: cluster-qmul89-subnet-public-us-west-1c
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: public
- After CAPI patched
network:
subnets:
- availabilityZone: us-west-1a
cidrBlock: 10.0.0.0/24
- availabilityZone: us-west-1a
cidrBlock: 10.0.1.0/24
- availabilityZone: us-west-1c
cidrBlock: 10.0.2.0/24
- availabilityZone: us-west-1c
cidrBlock: 10.0.3.0/24
This instance creation fails when AWSCluster spec's subnets is on the 2nd state, when there are subnets but without IDs. So subnet ID is empty here - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/ec2/instances.go#L340
Fixes
While the long-term solution is waiting for the fix of https://github.com/kubernetes-sigs/cluster-api/issues/6320, we can improve CAPA's subnet finding logic that assumes subnets have non-empty IDs (which has been the case)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
This should have been fixed with SSA support in CAPA.