terraform-provider-google icon indicating copy to clipboard operation
terraform-provider-google copied to clipboard

GKE AutoPilot Failure For Node Count

Open kylekurz opened this issue 3 years ago • 11 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.1.0

Affected Resource(s)

google_container_cluster

Terraform Configuration Files

provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_container_cluster" "primary" {
  name             = "${var.project_id}-gke"
  location         = var.region
  enable_autopilot = true
}

Debug Output

Panic Output

Expected Behavior

GKE AutoPilot cluster should spin up correctly

Actual Behavior

Terraform throws the following error:

│ Error: googleapi: Error 400: Max pods constraint on node pools for Autopilot clusters should be 32., badRequest
│ 
│   with module.gke-cluster.google_container_cluster.primary,
│   on gke-cluster/main.tf line 10, in resource "google_container_cluster" "primary":
│   10: resource "google_container_cluster" "primary" {

Steps to Reproduce

  1. terraform apply

Important Factoids

Provider version 4.3.0 works as expected, but I couldn't see anything obvious when glancing at the diff. Seems likely to be related to max_pods_constraint, but all that looks to the untrained eye like Azure or AWS stuff, somehow.

  • #0000

kylekurz avatar Dec 21 '21 19:12 kylekurz

Could you provide some debug logs? It's not exactly clear from my perspective what the issue is.

You can get these by setting the environment variable TF_LOG to debug. In particular it would be useful to see what we are using to call the api.

Is the configuration you provided complete?

ScottSuarez avatar Dec 21 '21 23:12 ScottSuarez

Logs are here: https://gist.github.com/kylekurz/45d872721ed58e2b6d4ff70f76b26e0c

The configuration provided above is all that is needed to trigger this, if you're on provider version 4.5.0. If I back the provider down to 4.3.0, it works as expected.

kylekurz avatar Dec 22 '21 14:12 kylekurz

I just tested 4.4.0 too, that has the same error case. So something in the upgrade from 4.3.0 -> 4.4.0 breaks this, it's not new in 4.5.0.

kylekurz avatar Dec 22 '21 14:12 kylekurz

Having the same issue - downgrading to 4.3.0 works as a work around

sashokbg avatar Dec 23 '21 11:12 sashokbg

We are aware of the issue and there is a related pull request in the works

https://github.com/GoogleCloudPlatform/magic-modules/pull/5540

ScottSuarez avatar Dec 28 '21 18:12 ScottSuarez

Is there any workaround in the meantime while the PR is merged & released?

Kukunin avatar Jan 27 '22 20:01 Kukunin

Is there any workaround in the meantime while the PR is merged & released?

Workaround is just to set the version of the provider, so as not to use the latest ones.

    gcp = {
      source  = "hashicorp/google"
      version = "4.3.0"
    }

grrywlsn avatar Jan 28 '22 09:01 grrywlsn

An alternative workaround is to set the ip_allocation_policy. Could even be empty like so:

resource "google_container_cluster" "primary" {
  name             = "${var.project_id}-gke"
  location         = var.region
  ip_allocation_policy {
  }
  enable_autopilot = true
}

c2thorn avatar Jan 28 '22 18:01 c2thorn

An alternative workaround is to set the ip_allocation_policy. Could even be empty like so:

resource "google_container_cluster" "primary" {
  name             = "${var.project_id}-gke"
  location         = var.region
  ip_allocation_policy {
  }
  enable_autopilot = true
}

Works with pulumi too.

  ipAllocationPolicy: {},

Thanks!

mzavaletavargas avatar Mar 28 '22 05:03 mzavaletavargas

+1

loeffel-io avatar Jun 01 '22 11:06 loeffel-io

Even the minor version increase errors with the posted message...

I just tried to build an autopilot cluster with providers v4.36.0.

terraform {
  required_version = "~> 1.2.9"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 4.36.0"
    }
    google-beta = {
      source  = "hashicorp/google-beta"
      version = "~> 4.36.0"
    }
  }
}

Dialed it back to v4.3.0 and it works.

todd-dsm avatar Sep 19 '22 02:09 todd-dsm

Ran into this problem too as soon as I started testing with auto-pilot. Surprised, it hasn't been fixed for so long.

meteatamel avatar Oct 24 '22 12:10 meteatamel

Is this still not fixed?

arueth avatar Jan 19 '23 19:01 arueth

Adding this inside the google_container_cluster resource fixed it for our team

resource "google_container_cluster" "foo" {
  ...
  
  ip_allocation_policy {
    cluster_secondary_range_name  = "pod-range"
    services_secondary_range_name = "service-range"
  }
}

NFollett89 avatar Jan 25 '23 21:01 NFollett89

Even simpler workaround to set networks to defaults:

  ip_allocation_policy {
    cluster_ipv4_cidr_block  = ""
    services_ipv4_cidr_block = ""
  }

jonaseck2 avatar Jan 31 '23 15:01 jonaseck2

I can confirm that on 4.56.0, you still need to use a workaround, currently using an empty ip_allocation _policy block as suggested above.

kylekurz avatar Mar 13 '23 13:03 kylekurz

I can confirm that on 4.56.0, you still need to use a workaround, currently using an empty ip_allocation _policy block as suggested above.

Also as of 4.59.0 the issue persists and is fixed by empty ip_allocation_policy

siikanen avatar Mar 31 '23 14:03 siikanen

v4.60.x is still an issue. I am facing it and reporting it.

muthukumars avatar Apr 08 '23 04:04 muthukumars

to be more clear 4.60.2

muthukumars avatar Apr 08 '23 05:04 muthukumars

Even simpler workaround to set networks to defaults:

  ip_allocation_policy {
    cluster_ipv4_cidr_block  = ""
    services_ipv4_cidr_block = ""
  }

Ran into this issue earlier, this workaround worked for me. Strange, but congratulations on finding that workaround.

AeroNotix avatar Apr 19 '23 13:04 AeroNotix

ver 4.63....... still need workaround why is that long??

greenozon avatar May 02 '23 20:05 greenozon

Hey folks, a fix has just been committed for this issue. Thanks for your patience!!

The change will be included the 4.72.0 provider release pending no revert or speedbumps.

ScottSuarez avatar Jun 26 '23 19:06 ScottSuarez

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Jul 27 '23 02:07 github-actions[bot]