terraform-google-kubernetes-engine
terraform-google-kubernetes-engine copied to clipboard
Windows node pools depend on Linux node pools
Hi,
We are getting an error when trying to create a GKE cluster with a windows node pool. This is the error message:
module.gke.google_container_node_pool.pools["node-pool"]: Still creating... [6m30s elapsed]
module.gke.google_container_node_pool.pools["node-pool"]: Still creating... [6m40s elapsed]
module.gke.google_container_node_pool.pools["node-pool"]: Creation complete after 6m45s [id=...]
Error: error creating NodePool: googleapi: Error 400: WINDOWS_SAC and WINDOWS_LTSC image families require at least one other Linux node pool (e.g. COS_CONTAINERD, COS, UBUNTU) in the cluster., badRequest
on .terraform/modules/gke/modules/beta-private-cluster-update-variant/cluster.tf line 395, in resource "google_container_node_pool" "pools":
395: resource "google_container_node_pool" "pools" {
Although this error happens, if we run the module a second time it runs perfectly without any error.
From the output and the tests performed this appears to be a timeout issue because the Linux node pool creation was completed module.gke.google_container_node_pool.pools["node-pool"]: Creation complete after 6m45s
but the new windows node pool started immediately after and probably the Linux node wasn't ready yet.
Side notes:
- We haven't tested using the default node pool instead of a new one and I think it's not a feasible solution since any changes to the default node pool force the cluster destruction.
- If we create a windows node pool resource that depends on the Linux node pool resource, the terraform also runs perfectly. (tested using only the resources without any dependency of the terraform-google-modules/kubernetes-engine module).
- The only solution besides running the module two times, would be create a node pool resource detached from the module that
depends_on = [module.gke]
.
Is there any other possible solution we are missing?
Used configuration:
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"
version = "14.0.1"
project_id = var.project_id
name = var.cluster_name
regional = false
region = "us-central1"
zones = ["us-central1-a"]
remove_default_node_pool = true
grant_registry_access = true
release_channel = "STABLE"
database_encryption = [{
state = "ENCRYPTED"
key_name = var.kms_encrypt_etcd
}]
network = module.gcp-network.network_name
subnetwork = module.gcp-network.subnets_names[0]
ip_range_pods = var.ip_range_pods_name
ip_range_services = var.ip_range_services_name
create_service_account = true
enable_private_nodes = true
master_ipv4_cidr_block = "172.16.0.0/28"
add_cluster_firewall_rules = true
identity_namespace = null
node_metadata = "EXPOSE"
master_global_access_enabled = true
node_pools = [
{
name = "node-pool"
machine_type = "g1-small"
min_count = 1
max_count = 1
disk_size_gb = 20
disk_type = "pd-standard"
image_type = "COS"
auto_repair = true
auto_upgrade = true
preemptible = false
initial_node_count = 1
},
{
name = "windows-nodepool"
machine_type = "n1-standard-4"
min_count = 1
max_count = 2
disk_size_gb = 50
disk_type = "pd-standard"
image_type = "WINDOWS_LTSC"
auto_repair = true
auto_upgrade = false
preemptible = false
enable_integrity_monitoring = false
initial_node_count = 1
},
]
node_pools_oauth_scopes = {
all = [
"https://www.googleapis.com/auth/trace.append",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/cloudkms",
"https://www.googleapis.com/auth/logging.write"
]
}
depends_on = [
data.google_compute_subnetwork.subnetwork
]
}
@Edelf
Is this consistently reproducible? Unfortunately I don't see an easy fix for this if it's a race condition.
@morgante It's not 100% consistent, but it's frequent
One possible fix would be to split node pools into two different resources: one for linux node pools and one for Windows node pools, with the latter dependent on the former. This would be a breaking change though.
For the record we are getting this exact problem.
We are still facing this problem. Is there any news regarding a timeline to solve this?
It's still on our backlog, but don't have a timeline.
If someone wants to pick it up in the meantime, I'd be happy to review a pull request.