terraform-aws-eks
terraform-aws-eks copied to clipboard
Cluster Security Group Change Issue
Description
It appears I've hit this bug: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2469
But specifically I am already over the v18/v19 module upgrades. The only change we've made is that our VPC has added a CIDR block, which seems to be breaking the module with:
Plan:
# module.cluster.module.eks.aws_security_group_rule.cluster["vpc"] must be replaced
+/- resource "aws_security_group_rule" "cluster" {
~ cidr_blocks = [ # forces replacement
# (2 unchanged elements hidden)
"10.60.7.0/24",
+ "10.40.0.0/16",
]
~ id = "sgrule-422031136" -> (known after apply)
- prefix_list_ids = [] -> null
+ security_group_rule_id = (known after apply)
+ source_security_group_id = (known after apply)
# (7 unchanged attributes hidden)
}
Apply:
module.cluster.module.eks.aws_security_group_rule.cluster["vpc"]: Creating...
β·
β Error: [WARN] A duplicate Security Group rule was found on (sg-07c6df08897a15c50). This may be
β a side effect of a now-fixed Terraform issue causing two security groups with
β identical attributes but different source_security_group_ids to overwrite each
β other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
β information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.35.192.0/18, TCP, from port: 443, to port: 443, ALLOW" already exists
β status code: 400, request id: 299a0a30-3cd1-47b9-9192-26f9f0cfda35
β
β with module.cluster.module.eks.aws_security_group_rule.cluster["vpc"],
β on .terraform/modules/cluster.eks/main.tf line 295, in resource "aws_security_group_rule" "cluster":
β 295: resource "aws_security_group_rule" "cluster" {
β
So I am somewhat concerned this is not an issue specific to the module upgrades but a bug/limitation in the module.
If your request is for a new feature, please use the Feature request template.
- [x] β I have searched the open/closed issues and my issue is not listed.
Versions
-
Module version [Required]: 20.1.1
-
Terraform version: v1.3.6
-
Provider version(s): v5.35.0
Reproduction Code
Steps to reproduce the behavior:
Modify the incoming VPC to add another CIDR block:
cluster_security_group_additional_rules = merge(
{
# Default Private Control Plane Access
vpc = {
description = "Allow VPC Network Access"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = data.aws_vpc.vpc.cidr_block_associations[*].cidr_block
type = "ingress"
},
private_network_access = {
description = "Allow Private Network Access"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = ["10.0.0.0/8"]
type = "ingress"
}
}, var.cluster_security_group_additional_rules)
Is the issue potentially that we're using a computed value as the input to this variable in the module?
Minimal re-pro would probably be to set this cluster_security_group_additional_rules and then modify it or provide a computed value that changes/forces modification.
Expected behavior
Terraform updates the security group rules
Actual behavior
Terraform errors out on an existing cluster
Additional context
I've been seeing this error a lot. All the other issues that I've seen with this are closed without solution. Theres the workaround of changing the IP ranges to force a change but thats not pratical. Another hacky way I found is to create a variable and append it to the key of the security group rules. For instance, in your case would be something like:
variable "security_group_update_trigger" {
description = "Trigger updates without getting duplicate security group rule error"
type = bool
default = false
}
cluster_security_group_additional_rules = merge(
{
# Default Private Control Plane Access
"vpc_${var.security_group_update_trigger}" = {
description = "Allow VPC Network Access"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = data.aws_vpc.vpc.cidr_block_associations[*].cidr_block
type = "ingress"
},
"private_network_access_${var.security_group_update_trigger}" = {
description = "Allow Private Network Access"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = ["10.0.0.0/8"]
type = "ingress"
}
}, var.cluster_security_group_additional_rules)
But if you have a huge list of CIDRs like in my case, its bad since you dont have the easy tracking of changes when adding/removing a single cidr, because it will show as adding/removing all.
I hope we can have a proper fix for this issue.
I don't see how these relate to the module - this looks like you could easily reproduce this on just simple security group rule resources.
I guess I had opened this because it was not clear to me why the aws_security_group_rule was triggering a replace as opposed to just modifying it. It seemed like it was a side effect of however its been implemented in the module.
its being replaced due to changes that it has identified here:
~ cidr_blocks = [ # forces replacement
# (2 unchanged elements hidden)
"10.60.7.0/24",
+ "10.40.0.0/16",
]
And I suspect the duplicate rules is being thrown because both 10.60.7.0/24 and 10.40.0.0/16 are already captured within the rule
private_network_access = {
description = "Allow Private Network Access"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = ["10.0.0.0/8"]
type = "ingress"
}
I can confirm the second isn't an issue, we had multiple overlapping rules already in place within the rule that do overlap with 10.0.0.0/8 and should have failed if that was an EC2 SG limitation.
I was also able to manually add 10.40.0.0/16 within the EC2 UI without error, its just Terraform failing because it is attempting to create_before_destroy the aws_security_group_rule and the rule it attempts to create is already on the SG.
To get around this issue, since I already had that overlapping 10.0.0.0/8 rule defined in Terraform I actually just deleted all the other more specific rules Terraform was trying to append manually, removed them from state, then re-applied Terraform successfully.
I'd suspected something about how the variables are getting passed into the aws_security_group_rule resource was triggering the create_before_destroy: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/main.tf#L296-L299
But wasn't sure how to investigate further. Also wonder if this even newer resource might avoid this as well, especially since within the EC2 console, each CIDR block we're adding has its own unique security group rule ID already: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_ingress_rule https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_egress_rule
the newer individual rules (cannot pass a list of CIDRs, etc.) are recommended but, that comes at the cost of disruption to users. we have started to use those where possible, but we have to identify the right time to introduce those since it will be disruptive for users
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days
This issue was automatically closed because of stale in 10 days
I'm going to lock this issue because it has been closed for 30 days β³. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.