terraform-provider-aws icon indicating copy to clipboard operation
terraform-provider-aws copied to clipboard

Terraform sending incorrect parameters to AWS API when revoking security group rules

Open ghost opened this issue 6 years ago • 8 comments

This issue was originally opened by @bstrohmeyer as hashicorp/terraform#21437. It was migrated here as a result of the provider split. The original body of the issue is below.


Terraform Version

Terraform v0.12.0
* provider.aws: version = "~> 2.12"

Terraform Configuration Files

module "sg_common" {
  source      = "./modules/security-group"
  name        = "sg_common"
  description = "Common access security group"
  vpc         = "${data.terraform_remote_state.vpc.outputs.vpc}"

  rules = [
    {
      type        = "ingress"
      from_port   = -1
      to_port     = -1
      protocol    = "icmp"
      description = "Allow ICMP"
      cidr        = ["0.0.0.0/0"]
    },
    {
      type        = "ingress"
      from_port   = 22
      to_port     = 22
      protocol    = "tcp"
      description = "SSH from Bastion"
      source_sg   = module.sg_sc-bastion.id
    },
    {
      type        = "ingress"
      from_port   = 22
      to_port     = 22
      protocol    = "tcp"
      description = "SSH from Gitlab Runners"
      cidr        = ["10.10.10.0/24"]
    },
    {
      type        = "egress"
      from_port   = 0
      to_port     = 0
      protocol    = "-1"
      description = "Outbound Traffic"
      cidr        = ["0.0.0.0/0"]
    },
  ]
}

subnets module:

resource "aws_subnet" "public" {
  vpc_id            = "${var.vpc["vpc_id"]}"
  cidr_block        = "${var.public.subnet_cidr}"
  availability_zone = "${var.subnet_az}"

  tags = {
    Name = "${var.public.subnet_name}"
  }
}

resource "aws_route_table" "public_route_table" {
  vpc_id = "${var.vpc["vpc_id"]}"

  tags = {
    Name = "public-${var.subnet_az}"
  }
}

resource "aws_route_table" "private_route_table" {
  vpc_id = "${var.vpc["vpc_id"]}"

  tags = {
    Name = "private-${var.subnet_az}"
  }
}

resource "aws_route" "igw_route" {
  route_table_id         = "${aws_route_table.public_route_table.id}"
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = "${var.vpc["igw_id"]}"
}

resource "aws_route_table_association" "rt_assn" {
  subnet_id      = "${aws_subnet.public.id}"
  route_table_id = "${aws_route_table.public_route_table.id}"
}

resource "aws_eip" "nat" {
  vpc      = true
}

resource "aws_nat_gateway" "natgw" {
  allocation_id = "${aws_eip.nat.id}"
  subnet_id     = "${aws_subnet.public.id}"

  tags = {
    Name = "natgw-${var.subnet_az}"
  }
}

resource "aws_route" "nat_route" {
  route_table_id         = "${aws_route_table.private_route_table.id}"
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id             = "${aws_nat_gateway.natgw.id}"
}

resource "aws_subnet" "private" {
  count             = length(var.private)
  vpc_id            = var.vpc["vpc_id"]
  cidr_block        = lookup(var.private[count.index], "subnet_cidr")
  availability_zone = var.subnet_az

  tags = {
    Name = lookup(var.private[count.index], "subnet_name")
  }
}

resource "aws_route_table_association" "rt_assn_private" {
  count          = length(var.private)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private_route_table.id
}

Debug Output

Crash Output

Expected Behavior

Added new rule to security group:

    {
      type        = "ingress"
      from_port   = 22
      to_port     = 22
      protocol    = "tcp"
      description = "SSH from Gitlab Runners"
      cidr        = ["10.10.10.0/24"]
    },

Expected Terraform to add this rule to the existing security group.

Actual Behavior

This rule was added in the middle of the list, which then due to the design of the module would require terraform to delete the Egress rule as the list order was upset. This is expected and is reflected in the terraform plan:

 # module.sg_common.aws_security_group_rule.source_cidr[2] must be replaced
-/+ resource "aws_security_group_rule" "source_cidr" {
      ~ cidr_blocks              = [ # forces replacement
          - "0.0.0.0/0",
          + "10.10.10.0/24",
        ]
      ~ description              = "Outbound Traffic" -> "SSH from Gitlab Runners"
      ~ from_port                = 0 -> 22 # forces replacement
      ~ id                       = "sgrule-3520689800" -> (known after apply)
      - ipv6_cidr_blocks         = [] -> null
      - prefix_list_ids          = [] -> null
      ~ protocol                 = "-1" -> "tcp" # forces replacement
        security_group_id        = "sg-09976fde26e3db3ea"
        self                     = false
      + source_security_group_id = (known after apply)
      ~ to_port                  = 0 -> 22 # forces replacement
      ~ type                     = "egress" -> "ingress" # forces replacement
    }

  # module.sg_common.aws_security_group_rule.source_cidr[3] will be created
  + resource "aws_security_group_rule" "source_cidr" {
      + cidr_blocks              = [
          + "0.0.0.0/0",
        ]
      + description              = "Outbound Traffic"
      + from_port                = 0
      + id                       = (known after apply)
      + protocol                 = "-1"
      + security_group_id        = "sg-09976fde26e3db3ea"
      + self                     = false
      + source_security_group_id = (known after apply)
      + to_port                  = 0
      + type                     = "egress"
    }

There was an error on apply, however:

Error: Error revoking security group sg-09976fde26e3db3ea rules: InvalidPermission.NotFound: The specified rule does not exist in this security group.
	status code: 400, request id: b0bba002-bbd7-4118-b47e-7822328109fa


Error: [WARN] A duplicate Security Group rule was found on (sg-09976fde26e3db3ea). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, ALL, ALLOW" already exists

  on .terraform/modules/sg_common/modules/security-group/main.tf line 12, in resource "aws_security_group_rule" "source_cidr":
  12: resource "aws_security_group_rule" "source_cidr" {

Upon investigation, it appears that terraform is sending the wrong parameters in the RevokeSecurityGroupEgress API call to AWS which results in the first error. From cloudtrail:

"eventSource": "ec2.amazonaws.com",
"eventName": "RevokeSecurityGroupEgress",
"awsRegion": "ap-northeast-1",
"sourceIPAddress": "[removed]",
"userAgent": "aws-sdk-go/1.19.36 (go1.12.5; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.12.0",
"errorCode": "Client.InvalidPermission.NotFound",
"errorMessage": "The specified rule does not exist in this security group.",
 "requestParameters": {
        "groupId": "sg-09976fde26e3db3ea",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 0,
                    "toPort": 0,
                    "groups": {},
                    "ipRanges": {
                        "items": [
                            {
                                "cidrIp": "0.0.0.0/0",
                                "description": "Outbound Traffic"
                            }
                        ]
                    },
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },

"ipProtocol": "tcp", is the new value of that security group rule as seen in the plan above, not the existing one. Terraform should be sending "ipProtocol": "-1", in this request to revoke the existing rule. This then results in AWS being unable to match this egress rule to the existing one and it cannot delete it, further resulting in a duplicate security group rule error when it tries to recreate the egress rule.

Steps to Reproduce

Add a security group rule to an existing security group that causes an egress rule to be deleted.

Additional Context

Terraform is running via docker via gitlab runner.

References

ghost avatar May 24 '19 18:05 ghost

This is still an issue as of provider version 2.16.0

ghost avatar Jun 25 '19 18:06 ghost

It's still an issue with Terraform 0.12.20 and aws provider 2.50.0

michcio1234 avatar Feb 24 '20 12:02 michcio1234

The issue seems to be related to terraform 0.12. The same code successfully applies with terraform 0.11 but produces an error with terraform 0.12.

Below is my minimal example:

resource "aws_security_group_rule" "sg_rule" {
  security_group_id = "<security-group-id>"

  type        = "egress"
  from_port   = 0
  to_port     = 0
  protocol    = "udp"

  cidr_blocks = [ "0.0.0.0/0" ]
}

When I change protocol (e.g. from udp to tcp) terraform 0.11 (v0.11.14) successfully applies the change, however terraform 0.12 (v0.12.20) fails with InvalidPermission.NotFound: The specified rule does not exist in this security group.

In both cases I use AWS provider v2.32.

dshkyra avatar Mar 07 '20 14:03 dshkyra

I'm seeing a very strange behaviour with Terraform v0.12.24 + provider.aws v2.56.0.

I ran $ terraform apply, didn't change anything, ran it again to verify that no changes would take place, but I was hit by the security group forcing replacement of my EC2 servers. It is the only security group present in my code, and it has only two ingress rules and one egress rule.

What I couldn't understand is that inside the EC2 creation, Terraform can already recognise the existing security group ID, however it insists on replacing it.

 ~ vpc_security_group_ids       = [
          - "sg-081b9d9f9e40d0b4d",
        ] -> (known after apply)

It's worth mentioning that I have a root module, and three other modules

root | - Networking | - Compute | - Storage

The security group variable is:

  1. output from Networking module,
  2. defined as an empty variable in the variables.tf of the Compute module, and
  3. read by the root module as output of the Networking module and passed to the Compute module when calling it.

Here is security group code:

# SG
resource "aws_security_group" "tf_sg" {
  name = "tf_sg"
  description = "All in one Security Group for SSH, HTTP, etc.."
  vpc_id = "${aws_vpc.tf_vpc.id}"

  # 1. SSH
  ingress {
    from_port = 22
    protocol = "tcp"
    to_port = 22
    cidr_blocks = ["${var.sourceip}"]
  }

  # 2. HTTP
  ingress {
    from_port = 80
    protocol = "tcp"
    to_port = 80
    cidr_blocks = ["${var.everywhere}"]
  }

  # all out permitted, -1 is all protocols including: tcp, upd, icmp, ...
  egress {
    from_port = 0
    protocol = "-1"
    to_port = 0
    cidr_blocks = ["${var.everywhere}"]
  }

I am afraid this is very dangerous and invalidates many use cases involving EC2 and SG.

engie-tawfik avatar Apr 28 '20 00:04 engie-tawfik

I can confirm this is still an issue as of terraform v0.12.29 and provider.aws v3.0.0. We were using ingress rules instead of egress rules. Terraform was sending a delete command to AWS to delete a rule that didn't exist - a rule based on the new attributes rather than the old attributes like it should have been. I can recreate the problem by taking any existing security group rule and modifying the rule. Once it happens, I can run as many "terraform apply" as I want and the problem won't clear. It appears I can avoid the problem by changing the name of the security group rule when making the change - then terraform views the operation as a delete and a create, not a replace. The problem can also be avoided by first doing a delete, applying it, and then creating and applying.

PCjrBasic avatar Aug 12 '20 20:08 PCjrBasic

After further testing, it appears this doesn't impact all replacements as I previously though. It appears to only impact rules where the protocol is changed. Changes to the port, cidr, or description don't seem to impact it. When I change the protocol though (tried tcp -> all and tcp -> udp), I seem to be able to reproduce it 100% of the time.

PCjrBasic avatar Aug 19 '20 20:08 PCjrBasic

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

github-actions[bot] avatar Aug 10 '22 17:08 github-actions[bot]

Bumping for visibility, any progress on this?

ed-silva-eb avatar Aug 10 '22 17:08 ed-silva-eb

This functionality has been released in v4.29.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions[bot] avatar Sep 02 '22 13:09 github-actions[bot]

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Oct 03 '22 02:10 github-actions[bot]