terraform-google-kubernetes-engine icon indicating copy to clipboard operation
terraform-google-kubernetes-engine copied to clipboard

`intra-egress` firewall rule depends on `google_container_cluster.primary`

Open tomasgareau opened this issue 3 years ago • 6 comments

Summary

The intra-cluster egress firewall rule appears to depend on google_container_cluster.primary. This means it will not be created until after the default node pool is created and deleted. If this node pool requires the firewall rule in order to report back to the control plane (as is the case in a private cluster in a VPC network with a default-deny-all rule), the default node pool can never get created.

Context

I'm deploying a safer-cluster in a VPC network that enforces a default deny-all rule. Exceptions are granted through the use of network tags.

The safer-cluster module includes an add_cluster_firewall_rules variable that adds firewall rules for this use-case (in particular, one allowing intra-cluster egress, which is required for node pools to communicate back to the control plane).

Through #1118 & #1123 I added the gke-${var.name} network tag to the default pool so that this firewall rule would apply to it.

Expected behaviour

I would now expect the following sequence of actions:

  • the intra-cluster egress firewall rule is created
  • the google_container_cluster.primary resource is created
  • the default node pool gets created with network tags marking it as a target of the previously created firewall rule (I'm running the branch from PR 1123 which adds the gke-${var.name} tag to the default pool)
  • the default node pool reports back to the control plane
  • the default node pool gets deleted
  • Terraform continues creating my custom node pool

Actual behaviour

  • the google_container_cluster.primary resource starts being created
  • the intra-cluster egress firewall rule is waiting on the google_container_cluster.primary resource to be created
  • the default node pool is spun up but there is no firewall rule allowing it to report back to the control plane
  • GKE waits for the default node pool to check in and eventually times out

Suspected root cause

fff0078 added TPU support to beta clusters. Part of this change included adding google_container_cluster.primary.tpu_ipv4_cidr_block to the destination_ranges of the gke-intra-egress firewall rule:

https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/fff007887483803101145be79e8f83c6dd288e0e/autogen/main/firewall.tf.tmpl#L37-L43

I suspect this creates a Terraform dependency on the google_container_cluster.primary resource. In my case, I have not enabled TPU (and have no need for it). I reverted fff0078 on my test branch and was able to bring up my safer-cluster instance successfully (i.e., the intra-cluster egress firewall rule was created before the default pool).

Potential fix

This could maybe be addressed by creating two firewall rules -- one that can be created before the cluster (without the TPU IPv4 block) and one that is created after the cluster (for the TPU IPv4 block). Don't know if this is zooming in too narrowly on my use-case, however (-:

tomasgareau avatar Jan 12 '22 21:01 tomasgareau

Honestly, this seems like a bit of an edge case. Since you already have a way to grant the correct network tag, shouldn't your existing firewall rule be able to allow the traffic?

morgante avatar Jan 12 '22 22:01 morgante

My problem is that the firewall rule is never getting created -- there's a Terraform dependency causing it to wait for the google_container_cluster.primary resource. The google_container_cluster.primary resource, however, needs the firewall rule in order to finish creating, so my apply operation blocks until GKE times out.

This comment from the firewall.tf.tmpl seems to suggest that this is the intended use case for add_cluster_firewall_rules (creating a cluster in a VPC network with a default deny all rule). I'm not 100% sure if/where my use-case is different from expected.

That said -- I can certainly copy these firewall rules and create them myself (without the implicit dependency on the google_container_cluster.primary resource) but it feels a bit strange to be duplicating this functionality from the terraform-google-kubernetes-engine modules.

tomasgareau avatar Jan 12 '22 22:01 tomasgareau

Hmm, the firewall rules were originally meant as as "shadow" rules to enable logging. #741 has more context.

@rux616 Since you added the TPU support in #810, can you confirm this current setup is actually working for you? I'm actually unsure why it would.

This could maybe be addressed by creating two firewall rules -- one that can be created before the cluster (without the TPU IPv4 block) and one that is created after the cluster (for the TPU IPv4 block). Don't know if this is zooming in too narrowly on my use-case, however (-:

We could probably split into a rule for the cluster overall, then a rule specific to just TPUs.

morgante avatar Jan 12 '22 23:01 morgante

I think the intra-egress firewall rule was introduced in #470 (git blame pointed me to 16bdd6e6) -- not sure that it's necessarily related to the "shadow" rules.

If the rule split would work I'd be happy to take a crack at a PR tomorrow!

tomasgareau avatar Jan 12 '22 23:01 tomasgareau

@rux616 Since you added the TPU support in #810, can you confirm this current setup is actually working for you? I'm actually unsure why it would.

I'm currently on holiday, but will dig into this when I get back next week. Just off the top of my head though, I believe we haven't had to touch the code on our end where the cluster is defined in a while as it has been working for us.

rux616 avatar Jan 13 '22 08:01 rux616

Since #1126 is merged (thanks to @tomasgareau!), this issue can be closed as fixed.

legal90 avatar Dec 21 '23 16:12 legal90