terraform-google-kubernetes-engine
terraform-google-kubernetes-engine copied to clipboard
`intra-egress` firewall rule depends on `google_container_cluster.primary`
Summary
The intra-cluster egress firewall rule appears to depend on google_container_cluster.primary. This means it will not be created until after the default node pool is created and deleted. If this node pool requires the firewall rule in order to report back to the control plane (as is the case in a private cluster in a VPC network with a default-deny-all rule), the default node pool can never get created.
Context
I'm deploying a safer-cluster in a VPC network that enforces a default deny-all rule. Exceptions are granted through the use of network tags.
The safer-cluster module includes an add_cluster_firewall_rules variable that adds firewall rules for this use-case (in particular, one allowing intra-cluster egress, which is required for node pools to communicate back to the control plane).
Through #1118 & #1123 I added the gke-${var.name} network tag to the default pool so that this firewall rule would apply to it.
Expected behaviour
I would now expect the following sequence of actions:
- the intra-cluster egress firewall rule is created
- the
google_container_cluster.primaryresource is created - the default node pool gets created with network tags marking it as a target of the previously created firewall rule (I'm running the branch from PR 1123 which adds the
gke-${var.name}tag to the default pool) - the default node pool reports back to the control plane
- the default node pool gets deleted
- Terraform continues creating my custom node pool
Actual behaviour
- the
google_container_cluster.primaryresource starts being created - the intra-cluster egress firewall rule is waiting on the
google_container_cluster.primaryresource to be created - the default node pool is spun up but there is no firewall rule allowing it to report back to the control plane
- GKE waits for the default node pool to check in and eventually times out
Suspected root cause
fff0078 added TPU support to beta clusters. Part of this change included adding google_container_cluster.primary.tpu_ipv4_cidr_block to the destination_ranges of the gke-intra-egress firewall rule:
https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/fff007887483803101145be79e8f83c6dd288e0e/autogen/main/firewall.tf.tmpl#L37-L43
I suspect this creates a Terraform dependency on the google_container_cluster.primary resource. In my case, I have not enabled TPU (and have no need for it). I reverted fff0078 on my test branch and was able to bring up my safer-cluster instance successfully (i.e., the intra-cluster egress firewall rule was created before the default pool).
Potential fix
This could maybe be addressed by creating two firewall rules -- one that can be created before the cluster (without the TPU IPv4 block) and one that is created after the cluster (for the TPU IPv4 block). Don't know if this is zooming in too narrowly on my use-case, however (-:
Honestly, this seems like a bit of an edge case. Since you already have a way to grant the correct network tag, shouldn't your existing firewall rule be able to allow the traffic?
My problem is that the firewall rule is never getting created -- there's a Terraform dependency causing it to wait for the google_container_cluster.primary resource. The google_container_cluster.primary resource, however, needs the firewall rule in order to finish creating, so my apply operation blocks until GKE times out.
This comment from the firewall.tf.tmpl seems to suggest that this is the intended use case for add_cluster_firewall_rules (creating a cluster in a VPC network with a default deny all rule). I'm not 100% sure if/where my use-case is different from expected.
That said -- I can certainly copy these firewall rules and create them myself (without the implicit dependency on the google_container_cluster.primary resource) but it feels a bit strange to be duplicating this functionality from the terraform-google-kubernetes-engine modules.
Hmm, the firewall rules were originally meant as as "shadow" rules to enable logging. #741 has more context.
@rux616 Since you added the TPU support in #810, can you confirm this current setup is actually working for you? I'm actually unsure why it would.
This could maybe be addressed by creating two firewall rules -- one that can be created before the cluster (without the TPU IPv4 block) and one that is created after the cluster (for the TPU IPv4 block). Don't know if this is zooming in too narrowly on my use-case, however (-:
We could probably split into a rule for the cluster overall, then a rule specific to just TPUs.
I think the intra-egress firewall rule was introduced in #470 (git blame pointed me to 16bdd6e6) -- not sure that it's necessarily related to the "shadow" rules.
If the rule split would work I'd be happy to take a crack at a PR tomorrow!
@rux616 Since you added the TPU support in #810, can you confirm this current setup is actually working for you? I'm actually unsure why it would.
I'm currently on holiday, but will dig into this when I get back next week. Just off the top of my head though, I believe we haven't had to touch the code on our end where the cluster is defined in a while as it has been working for us.
Since #1126 is merged (thanks to @tomasgareau!), this issue can be closed as fixed.