terraform-google-kubernetes-engine Windows node pools depend on Linux node pools

Hi,

We are getting an error when trying to create a GKE cluster with a windows node pool. This is the error message:

module.gke.google_container_node_pool.pools["node-pool"]: Still creating... [6m30s elapsed]
module.gke.google_container_node_pool.pools["node-pool"]: Still creating... [6m40s elapsed]
module.gke.google_container_node_pool.pools["node-pool"]: Creation complete after 6m45s [id=...]
Error: error creating NodePool: googleapi: Error 400: WINDOWS_SAC and WINDOWS_LTSC image families require at least one other Linux node pool (e.g. COS_CONTAINERD, COS, UBUNTU) in the cluster., badRequest
  on .terraform/modules/gke/modules/beta-private-cluster-update-variant/cluster.tf line 395, in resource "google_container_node_pool" "pools":
 395: resource "google_container_node_pool" "pools" {

Although this error happens, if we run the module a second time it runs perfectly without any error.

From the output and the tests performed this appears to be a timeout issue because the Linux node pool creation was completed module.gke.google_container_node_pool.pools["node-pool"]: Creation complete after 6m45s but the new windows node pool started immediately after and probably the Linux node wasn't ready yet.

Side notes:

We haven't tested using the default node pool instead of a new one and I think it's not a feasible solution since any changes to the default node pool force the cluster destruction.
If we create a windows node pool resource that depends on the Linux node pool resource, the terraform also runs perfectly. (tested using only the resources without any dependency of the terraform-google-modules/kubernetes-engine module).
The only solution besides running the module two times, would be create a node pool resource detached from the module that depends_on = [module.gke].

Is there any other possible solution we are missing?

Used configuration:

module "gke" {
  source                   = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"
  version                  = "14.0.1"
  project_id               = var.project_id
  name                     = var.cluster_name
  regional                 = false
  region                   = "us-central1"
  zones                    = ["us-central1-a"]
  remove_default_node_pool = true
  grant_registry_access    = true
  release_channel          = "STABLE"
  database_encryption = [{
    state    = "ENCRYPTED"
    key_name = var.kms_encrypt_etcd
  }]

  network                = module.gcp-network.network_name
  subnetwork             = module.gcp-network.subnets_names[0]
  ip_range_pods          = var.ip_range_pods_name
  ip_range_services      = var.ip_range_services_name
  create_service_account = true
  enable_private_nodes         = true
  master_ipv4_cidr_block       = "172.16.0.0/28"
  add_cluster_firewall_rules   = true
  identity_namespace           = null
  node_metadata                = "EXPOSE"
  master_global_access_enabled = true


  node_pools = [
  {
    name               = "node-pool"
    machine_type       = "g1-small"
    min_count          = 1
    max_count          = 1
    disk_size_gb       = 20
    disk_type          = "pd-standard"
    image_type         = "COS"
    auto_repair        = true
    auto_upgrade       = true
    preemptible        = false
    initial_node_count = 1
  },
  {
    name                        = "windows-nodepool"
    machine_type                = "n1-standard-4"
    min_count                   = 1
    max_count                   = 2
    disk_size_gb                = 50
    disk_type                   = "pd-standard"
    image_type                  = "WINDOWS_LTSC"
    auto_repair                 = true
    auto_upgrade                = false
    preemptible                 = false
    enable_integrity_monitoring = false
    initial_node_count          = 1
  },
]

  node_pools_oauth_scopes = {
    all = [
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/cloudkms",
      "https://www.googleapis.com/auth/logging.write"
    ]
  }

  depends_on = [
    data.google_compute_subnetwork.subnetwork
  ]
}

@Edelf

Mar 25 '21 15:03 fsequeira1

Is this consistently reproducible? Unfortunately I don't see an easy fix for this if it's a race condition.

Mar 25 '21 16:03 morgante

@morgante It's not 100% consistent, but it's frequent

Mar 25 '21 16:03 fsequeira1

One possible fix would be to split node pools into two different resources: one for linux node pools and one for Windows node pools, with the latter dependent on the former. This would be a breaking change though.

Mar 25 '21 16:03 morgante

For the record we are getting this exact problem.

Sep 03 '21 15:09 gfthybridlabs

We are still facing this problem. Is there any news regarding a timeline to solve this?

Sep 14 '21 15:09 mmf55

It's still on our backlog, but don't have a timeline.

If someone wants to pick it up in the meantime, I'd be happy to review a pull request.

Sep 14 '21 18:09 morgante

terraform-google-kubernetes-engine terraform-google-kubernetes-engine copied to clipboard

Windows node pools depend on Linux node pools

terraform-google-kubernetes-engine
terraform-google-kubernetes-engine copied to clipboard