terraform-google-kubernetes-engine icon indicating copy to clipboard operation
terraform-google-kubernetes-engine copied to clipboard

Error / issue applying kubelet config

Open wyardley opened this issue 1 year ago • 14 comments

TL;DR

See also #2013

I'm seeing a permadrift which may or may not be related to having manually (outside of tf) enabled a kubelet config setting. I am somewhat confident that before this change, I did not have a permadiff or error applying this state.

Expected behavior

The config to apply

Observed behavior

  ~ resource "google_container_node_pool" "pools" {
        id                          = "projects/xxx/locations/us-central1/clusters/yyy/nodePools/primary"
        name                        = "primary"
        # (10 unchanged attributes hidden)

      ~ node_config {
            tags                        = [
                "gke-prod-cluster-01",
                "gke-prod-cluster-01-primary",
            ]
            # (17 unchanged attributes hidden)

          - kubelet_config {
              - cpu_cfs_quota  = false -> null
              - pod_pids_limit = 0 -> null
            }

            # (2 unchanged blocks hidden)
        }

This diff and then this

module.gke.google_container_node_pool.pools["primary"]: Modifying... [id=projects/xxx/locations/us-central1/clusters/yyy/nodePools/primary]
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0xaf5070f5462ddf7d"
│   }
│ ]
│ , badRequest
│ 
│   with module.gke.google_container_node_pool.pools["primary"],
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 491, in resource "google_container_node_pool" "pools":
│  491: resource "google_container_node_pool" "pools" {
│ 
╵

See further debug output below

Terraform Configuration

module "gke" {
  source                = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version               = "31.1.0"
  project_id            = var.project
  name                  = "foo-cluster-01"
  service_account_name  = "foo-cluster-01"
  grant_registry_access = true
  kubernetes_version    = "1.29.6-gke.1326000"
  release_channel       = "UNSPECIFIED"
  region                = "us-central1"
  zones = [
    data.google_compute_zones.available.names[1],
    data.google_compute_zones.available.names[2],
  ]
  network = data.terraform_remote_state.network.outputs.network_name

  subnetwork = data.terraform_remote_state.network.outputs.subnets_names[0]
  ip_range_pods     = data.terraform_remote_state.network.outputs.subnets_secondary_ranges[0][0].range_name
  ip_range_services = data.terraform_remote_state.network.outputs.subnets_secondary_ranges[0][1].range_name

  horizontal_pod_autoscaling = true
  enable_private_nodes       = true

  master_authorized_networks = local.all_allowlist_ranges
  dns_cache                  = true

  remove_default_node_pool = true
  node_pools = [
    # Note: this is intentionally different from the actual default,
    # "default-pool"
    {
      name                      = "primary"
      machine_type              = var.instance_type
      total_min_count           = var.node_pool_total_min_count
      total_max_count           = var.node_pool_total_max_count
      local_ssd_count           = 0
      spot                      = false
      local_ssd_ephemeral_count = 0
      disk_size_gb              = 100
      disk_type                 = "pd-balanced"
      image_type                = "COS_CONTAINERD"
      enable_gcfs               = false
      enable_gvnic              = false
      logging_variant           = "DEFAULT"
      auto_upgrade              = false
      preemptible               = false
      # Note: this was an attempt to resolve the permadiff; fails without it too
      pod_pids_limit            = 0
    },
  ]

  node_pools_oauth_scopes = {
    # Note: use cloud platform only, and manage monitoring etc. permissions via
    # IAM
    all = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }
}

Terraform Version

OpenTofu v1.7.2
on darwin_arm64
+ provider registry.opentofu.org/hashicorp/external v2.3.3
+ provider registry.opentofu.org/hashicorp/google v5.37.0
+ provider registry.opentofu.org/hashicorp/kubernetes v2.31.0
+ provider registry.opentofu.org/hashicorp/null v3.2.2
+ provider registry.opentofu.org/hashicorp/random v3.6.2


### Additional information

2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: PUT /v1/projects/xxx/locations/us-central1/clusters/yyyy/nodePools/primary?alt=json&prettyPrint=false HTTP/1.1 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Host: container.googleapis.com 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: User-Agent: google-api-go-client/0.5 Terraform/1.7.2 (+https://www.terraform.io) Terraform-Plugin-SDK/2.33.0 terraform-provider-google/dev blueprints/terraform/terraform-google-kubernetes-engine:private-cluster/v31.1.0 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Content-Length: 25 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Content-Type: application/json 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: X-Goog-Api-Client: gl-go/1.21.11 gdcl/0.185.0 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Accept-Encoding: gzip 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: "nodePoolId": "primary" 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: ----------------------------------------------------- 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: 2024/07/26 11:49:18 [DEBUG] Google API Response Details: 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ---[ RESPONSE ]-------------------------------------- 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: HTTP/2.0 400 Bad Request 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Cache-Control: private 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Content-Type: application/json; charset=UTF-8 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Date: Fri, 26 Jul 2024 18:49:18 GMT 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Server: ESF 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Vary: Origin 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Vary: X-Origin 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Vary: Referer 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: X-Content-Type-Options: nosniff 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: X-Frame-Options: SAMEORIGIN 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: X-Xss-Protection: 0 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "error": { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "code": 400, 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "message": "At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning'] must be specified.", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "errors": [ 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "message": "At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning'] must be specified.", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "domain": "global", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "reason": "badRequest" 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ], 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "status": "INVALID_ARGUMENT", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "details": [ 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "@type": "type.googleapis.com/google.rpc.RequestInfo", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "requestId": "0xa4a8369efaf57da0" 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ] 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ----------------------------------------------------- 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: 2024/07/26 11:49:18 [DEBUG] Retry Transport: Stopping retries, last request failed with non-retryable error: googleapi: got HTTP response code 400 with body: HTTP/2.0 400 Bad Request

wyardley avatar Jul 26 '24 18:07 wyardley

I also am encountering this issue. I have added (in node_pools) the following values, as per the documentation:

    cpu_cfs_quota      = false
    pod_pids_limit     = 0

However, on each plan, it is ignored and Terraform wants to revert back to the default values of null:

# module.gke.google_container_node_pool.pools["default"] will be updated in-place
~ resource "google_container_node_pool" "pools" {
      id                          = "projects/redacted/nodePools/default-5a32"
      name                        = "default-5a32"
      # (11 unchanged attributes hidden)

    ~ node_config {
          tags                        = [
              "redacted",
              "redacted-default",
              "default",
          ]
          # (20 unchanged attributes hidden)

        - kubelet_config {
            - cpu_cfs_quota        = false -> null
            - pod_pids_limit       = 0 -> null
              # (2 unchanged attributes hidden)
          }

          # (3 unchanged blocks hidden)
      }

      # (5 unchanged blocks hidden)
  }

Plan: 0 to add, 1 to change, 0 to destroy.

I'm using version 31.1.0 of the private-cluster-update-variant module.

Nickmman avatar Jul 27 '24 04:07 Nickmman

Same issue here when using the private cluster module.

module.gke.google_container_node_pool.pools["default-node-pool"] will be updated in-place
  ~ resource "google_container_node_pool" "pools" {
        id                          = "projects/xxx/locations/us-east4/clusters/yyy/nodePools/default-node-pool"
        name                        = "default-node-pool"
        # (11 unchanged attributes hidden)

      ~ node_config {
            tags                        = [
                "gke-staging",
                "gke-staging-default-node-pool",
                "default-node-pool",
            ]
            # (20 unchanged attributes hidden)

          - kubelet_config {
              - cpu_cfs_quota        = false -> null
              - pod_pids_limit       = 0 -> null
                # (2 unchanged attributes hidden)
            }

            # (2 unchanged blocks hidden)
        }

        # (5 unchanged blocks hidden)
    }

I've checked the code and node_config doesn't support kubelet_config as a dynamic block. I'm using version 31.1.0 of the private-cluster module.

Edit: master works, I just replaced source by:

source = "git::https://github.com/terraform-google-modules/terraform-google-kubernetes-engine//modules/private-cluster?ref=master"

hernan82arg avatar Jul 30 '24 08:07 hernan82arg

also happening for me:

Terraform v1.9.4
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v5.27.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.31.0
+ provider registry.terraform.io/hashicorp/random v3.6.2
+ provider registry.terraform.io/hashicorp/tfe v0.54.0
+ provider registry.terraform.io/hashicorp/time v0.12.0

trenslow avatar Aug 05 '24 06:08 trenslow

Same here, after updating the google cloud provider from 5.30 to 5.42, stating to see this error with the module version 31.0, updated the module version to 32, but still failing, after adding this mentioned here, solved the issue

I also am encountering this issue. I have added (in node_pools) the following values, as per the documentation:

    cpu_cfs_quota      = false
    pod_pids_limit     = 0

However, on each plan, it is ignored and Terraform wants to revert back to the default values of null:

# module.gke.google_container_node_pool.pools["default"] will be updated in-place
~ resource "google_container_node_pool" "pools" {
      id                          = "projects/redacted/nodePools/default-5a32"
      name                        = "default-5a32"
      # (11 unchanged attributes hidden)

    ~ node_config {
          tags                        = [
              "redacted",
              "redacted-default",
              "default",
          ]
          # (20 unchanged attributes hidden)

        - kubelet_config {
            - cpu_cfs_quota        = false -> null
            - pod_pids_limit       = 0 -> null
              # (2 unchanged attributes hidden)
          }

          # (3 unchanged blocks hidden)
      }

      # (5 unchanged blocks hidden)
  }

Plan: 0 to add, 1 to change, 0 to destroy.

I'm using version 31.1.0 of the private-cluster-update-variant module.

rekiemfaxaf avatar Aug 23 '24 22:08 rekiemfaxaf

FWIW, for me, with v 32.x, the permadiff eventually shifted to a diff of cpu_manager_policy, which was easier to solve by setting it to the valid, but not documented, value of "" -- comment:

https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/2013#issuecomment-2305452939

wyardley avatar Aug 23 '24 23:08 wyardley

Also ran into this with v33.02 of private-cluster-update-variant module and TPG v5.44.0

image

It fails to update the cluster

Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning', 'max_run_duration'] must be specified. 

Details: [ { "@type": "type.googleapis.com/google.rpc.RequestInfo", "requestId": "0x32be3a3a868d29d7" } ] , badRequest

derhally avatar Sep 13 '24 21:09 derhally

I am also running into this issue, if anyone has a workaround I would appreciate it as its currently causing issues with out deployment.

  source  = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster"
  version = "~> 33.1"
  ~ resource "google_container_node_pool" "pools" {

        # (10 unchanged attributes hidden)

      ~ node_config {
            tags                        = [
            ]
            # (17 unchanged attributes hidden)

          + gcfs_config {
              + enabled = false
            }

          - kubelet_config {
              - cpu_cfs_quota                          = false -> null
              - insecure_kubelet_readonly_port_enabled = "TRUE" -> null
              - pod_pids_limit                         = 0 -> null
            }

            # (2 unchanged blocks hidden)
        }

        # (5 unchanged blocks hidden)
    }
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 6.6.0, < 7"
    }
  }
}

ghost avatar Oct 10 '24 13:10 ghost

Same issue here, permanent drift that fails on apply

      ~ node_config {
            tags                        = [
                "gke-gke-dr",
                "gke-gke-dr-t2d-16",
            ]
            # (20 unchanged attributes hidden)

          - kubelet_config {
              - cpu_cfs_quota                          = false -> null
              - insecure_kubelet_readonly_port_enabled = "FALSE" -> null
              - pod_pids_limit                         = 0 -> null
                # (2 unchanged attributes hidden)
            }

And the error it fails with:

│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning', 'max_run_duration'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0x15856f1dc84fc347"
│   }
│ ]
│ , badRequest

LP0101 avatar Oct 10 '24 18:10 LP0101

@LP0101 that may be related to the issue described here and here (though in your case, it's false vs. true, so maybe not related to the new default, unless the API is now sometimes, in some places, returning both true / false all the time?).

Sounds like there may be a fix coming that will hopefully help with the apply failure, though if / when #2082 ships, that should at least allow you to match what's coming back from the API better.

wyardley avatar Oct 10 '24 18:10 wyardley

Thanks for the links @wyardley , fingers crossed for a fix soon.

It all scans too - we have an older cluster that was imported into TF, and we don't observe the issue there, likely because the older nodepools don't have the kubelet_config created on the API side

LP0101 avatar Oct 10 '24 18:10 LP0101

same for me on v33.0.3 of private-cluster submodule and provider hashicorp/google v5.42.0

RuiSMagalhaes avatar Oct 10 '24 19:10 RuiSMagalhaes

The following resolved the issue for me as a workaround at least:

❯ cat /tmp/kubelet-config
kubeletConfig:
  cpuManagerPolicy: ""

gcloud container node-pools update node-pool --cluster=gke-cluster --location=europe-west2 --project=my-project --system-config-from-file=/tmp/kubelet-config

ghost avatar Oct 10 '24 19:10 ghost

The latest 5.X and 6.X releases, 5.44.2 and 6.7.0, both include a fix for the recent kubeletConfig issues, see https://github.com/hashicorp/terraform-provider-google/issues/19792

6.8.0 should mitigate some of the cases where this error is returned in the future, after I merge https://github.com/GoogleCloudPlatform/magic-modules/pull/11978. It won't resolve underlying issues necessarily and may just shift the error- probably to a permadiff- but will at least stop masking them.

rileykarson avatar Oct 14 '24 19:10 rileykarson

6.7.0 worked. Thanks

michel-numan avatar Oct 15 '24 08:10 michel-numan