terraform-google-kubernetes-engine icon indicating copy to clipboard operation
terraform-google-kubernetes-engine copied to clipboard

Firewall rules for Autopilot clusters are ineffective

Open dhess opened this issue 2 years ago • 13 comments

TL;DR

The add_cluster_firewall_rules and related firewall options do not appear to have any effect for Autopilot clusters. I've configured all of them, with higher priority than my default firewall rules in a shared VPC, but the hit counters show 0 hits. Meanwhile, the lower priority generic rules in my firewall are showing hits and are permitting the cluster to pass traffic.

Expected behavior

The higher priority firewall rules added by the module should be in effect, passing traffic, and showing hits.

Observed behavior

The rules are ignored.

Terraform Configuration

add_cluster_firewall_rules = true
firewall_priority = 900
...

Terraform Version

1.1.8

Additional information

I'm not yet very experienced with GKE and firewall rules, but I have not yet figured out how to make firewall tags work with GKE Autopilot clusters in my own configs. It was only after I removed them that my firewall rules became effective for GKE Autopilot clusters. I noticed that this module is applying tags to the generated firewall rules (where the tag is equal to the GKE Autopilot cluster's name), so perhaps that's what's going on?

dhess avatar Apr 18 '22 21:04 dhess

This might indeed need further configuration. Most likely we are not setting the network tags appropriately for the node pools that autopilot auto-provisions.

morgante avatar Apr 18 '22 21:04 morgante

This might indeed need further configuration. Most likely we are not setting the network tags appropriately for the node pools that autopilot auto-provisions.

Maybe. The firewall rules that are automatically created by GKE do include a tag, but it's not just the name of the cluster — there's a machine-generated suffix, as well. I could not figure out how to generate that tag in my own config.

dhess avatar Apr 18 '22 21:04 dhess

Maybe. The firewall rules that are automatically created by GKE do include a tag, but it's not just the name of the cluster — there's a machine-generated suffix, as well. I could not figure out how to generate that tag in my own config.

Yeah, we currently rely on the network tag matching the cluster name.

For autopilot, that association might not hold true. Unfortunately it looks like we're blocked on adding support to the provider: https://github.com/hashicorp/terraform-provider-google/issues/11051

morgante avatar Apr 18 '22 21:04 morgante

This is what I did to create a rule to target nodes in autopilot cluster after the node pool is created. List all the firewall rules, and filter those targeting a tag starting with gke-$CLUSTER_NAME to get the tag.

TAG=$(gcloud compute firewall-rules list --project $PROJECT --filter="name~gke-$CLUSTER_NAME-[0-9a-z]*-master" --format="value(targetTags[0])") Then gcloud compute firewall-rules create my-rule --target-tags $TAG ...

Its neither elegant nor robust. But afaik there's no way to interrogate GKE to get a node's network tags.

intotecho avatar Aug 04 '22 10:08 intotecho

I think the issue might be about how Autopilot modules set network tags for firewall rules:

target_tags   = [local.cluster_network_tag]

The cluster_network_tag variable is defined as:

cluster_network_tag                        = "gke-${var.name}"

Other (GKE-managed) firewall rules have a different tag: gke-${var.name}-<ID>-node

ferrarimarco avatar Feb 06 '23 10:02 ferrarimarco

Yes, as @ferrarimarco highlights, the target_tags for the FW rule are a tag that is different to the gke-managed tag that ends up on the nodes that we can't access other than seeing it turn up in FW rules for intra-cluster coms or LBs. The problem is this: the shadow firewall rules, webhook firewall rules etc - their target_tags are a tag that is never attached to the cluster. I have a PR open to attach the cluster_network_tag to autopilot clusters so that these firewall rules can function. I don't think it's got any eyes on it yet though.

edit to add: This has the highest impact for the intra-egress rule which is required if the vpc has a deny 0.0.0.0/0 Without this rule the control plane can't talk to respective other CPs and the cluster fails to even come up.

PR if people have thoughts: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/1655

GorginZ avatar Jun 26 '23 00:06 GorginZ

Is the autopilot module be setting --autoprovisioning-network-tags instead of --tags ?

https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-network-tags#command-line_options_for_applying_network_tags

intotecho avatar Jun 26 '23 01:06 intotecho

The beta-autopilot-private-cluster and beta-autopilot-public-cluster modules set --autoprovisioning-network-tags. Which is the only option for autopilot.

I don't know the how or what of the tag we see from the gke auto created FW rules. Presumably it is on the nodes though because the gke provisioned firewall rules do work and also I note that service of type loadBalancer's respective provisioned firewalls for healthchecks have this same tag as a target_tag work - but this tag is inaccessible of course other than the "reflections" we see of it in the FW rules etc, because we can't get our hands on the nodes with autopilot clusters. But we can add auto provisioning network tags so we can add our "cluster_network_tag" to that to make sure additional FW rules are effective at least.

GorginZ avatar Jun 26 '23 01:06 GorginZ

I believe the issue here is that module creates its own local cluster_network_tag local in https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/493149d37aa13bbda1fc8dbcaec1b40f051c642e/modules/beta-autopilot-private-cluster/main.tf#L111 and uses it as a target for firewall rules eg. https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/29d925997ec0fd5ecbd6dd5792f220fc48a5725b/modules/beta-autopilot-private-cluster/firewall.tf#L36

However, when creating the cluster, the module takes user-provided variable network_tags and passes these to node_pool_auto_config in https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/493149d37aa13bbda1fc8dbcaec1b40f051c642e/modules/beta-autopilot-private-cluster/cluster.tf#L100-L107

I believe the firewall rules should just use var.network_tags too.

gustaff-weldon avatar Nov 21 '23 16:11 gustaff-weldon

@gustaff-weldon Hey thanks for commenting, haven't really got any eyes on this so appreciate the engagement!

hmm I don't think we can use var.network_tags firstly because they are an optional variable, so a user may have the additional FW vars toggled on, but not have any specific network additional network tags to add in var.network_tags, in which case we need to target something but we have nothing.

And the second thing I think is it would be a bit obscuring. This var is understood as the list of network tags to attach to the nodes, which people may have several of - like to allow reaching internal services or whatever, we don't want to create new FW rules targeting those tags, they already have a purpose presumably - for instance, the terraform-example-foundation the allow-google-apis tag is an example of a tag that would be on a bunch of VMs that are fairly "locked down" and you may chuck it on your cluster too, but we wouldn't want all the instances with that to also get a bunch of other FW rules now target them as well, it's an option for users to make sure any other required FW rules or tag-values they already know they'll need get on the nodes.

We already have the local cluster_network_tag - it has a meaningful name, it's meant to be a network tag, for cluster networking, that the intra-cluster networking FW rules already target, but it doesn't end up on the nodes. I think this is just a simple oversight/miss that impacts few people because it would only cause issues if someone had particularly hard-line FW rules in their network. I think we should just stick this tag on the nodes, and I am sure that's what was meant to happen but didn't, and it generally doesn't break stuff for people so fell through the cracks.

IDK if you've taken a look but lmk what you think https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/1655/files#diff-a3a07be3819af6f1a65ed243a613e69b544b7274d248a4b4d7d4d127aec40aee

GorginZ avatar Nov 23 '23 01:11 GorginZ

@GorginZ thanks for getting back. What I suggested is only one of the possible solutions. I was thinking that you might raise an empty network tags as a concern :) And I understand that you do not want to create firewall rules targeting network_tags. In such scenario, what we need is just to make sure that when user asks for firewall rules local.cluster_network_tag gets appended to tags in node_pool.

Your solution would work, I would also consider keeping conditions on locals level eg.:

locals {
  firewall_network_tag = "gke-gke-${var.env_code}-autopilot"
  cluster_network_tags = (var.enable_private_endpoint || var.enable_private_nodes) ? concat([local.firewall_network_tag], var.network_tags) : var.network_tags
}

Firewall rules would target firewall_network_tag, node pool, would iterate over cluster_network_tags. This avoids duplicating condition and also, makes it, in my opinion, clear what tags will be used for firewall and what for cluster nodes.

gustaff-weldon avatar Nov 23 '23 08:11 gustaff-weldon

Thanks @gustaff-weldon. Conditions in locals is probably nicer. I don't mind how we do it so long as we get the cluster_network_tag on the cluster when it's needed. My PR hasn't got any eyes so I'm not sure there's any interest in resolving this sadly. It was closed for being "stale", but I may re-open it.

GorginZ avatar Dec 14 '23 06:12 GorginZ

Hey guys I dont know if this will be helpful for you or not but I see its been giving you a hassle for a while, such an oversight by Google. So I automated the process and shared the script, this is a docker container that goes out and fetches the current list of IP's on your cluster and then updates your firewall rule removing any ip's not in the cluster and making sure that the IP's are there for the current cluster nodes. It does not update the firewall rules unless there is a discrepancy.

It could probably be improved, I just needed something that worked so that our production cluster was not impacted every time autopilot updated our nodes.

https://github.com/neuronic-ai/google-scripts

EncryptShawn avatar Apr 25 '24 16:04 EncryptShawn