terraform-provider-google icon indicating copy to clipboard operation
terraform-provider-google copied to clipboard

GKE autopilot is always created with default service account II

Open tSte opened this issue 3 years ago • 20 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

This is duplicate of https://github.com/hashicorp/terraform-provider-google/issues/8918 see https://github.com/hashicorp/terraform-provider-google/issues/8918#issuecomment-869990917 - sorry for creating this, but I don't seem to have rights to re-open the original issue (?) and it doesn't seem to be any activity there.

tSte avatar Jul 05 '21 08:07 tSte

@slevenick is there any update on this?

tSte avatar Aug 11 '21 12:08 tSte

I'm not sure how to proceed with this. This bug is due to a weird interaction between autopilot & the default service account field.

Basically, the API is not respecting the request that is sent with the service account. I'm not sure how gcloud is setting up the autopilot cluster with a non-default service account successfully. Can you capture the HTTP requests to see if that is happening in a single request, or if there is a later update to apply the service account?

slevenick avatar Aug 17 '21 17:08 slevenick

Hi, i run into the same problem.

@slevenick Is there any update on this subject ?

Best regards.

lrk avatar Sep 06 '21 08:09 lrk

Sorry for late answer @slevenick - I was on vacation...

I executed this:

gcloud container --project "hmplayground" clusters create-auto "my-cluster" --region "europe-west3" --release-channel "regular" --network "projects/hmplayground/global/networks/my-vpc" --subnetwork "projects/hmplayground/regions/europe-west3/subnetworks/my-subnet" --cluster-secondary-range-name="my-pods" --services-secondary-range-name="my-services" --enable-master-authorized-networks --enable-private-nodes --master-ipv4-cidr="172.16.0.16/28" --service-account="[email protected]" --scopes="logging-write,monitoring,storage-ro" --log-http

This is the request:

==== request start ====
uri: https://container.googleapis.com/v1/projects/hmplayground/locations/europe-west3/clusters?alt=json
method: POST
== headers start ==
b'X-Goog-User-Project': b'hmplayground'
b'accept': b'application/json'
b'accept-encoding': b'gzip, deflate'
b'authorization': --- Token Redacted ---
b'content-length': b'926'
b'content-type': b'application/json'
b'user-agent': b'google-cloud-sdk gcloud/344.0.0 command/gcloud.container.clusters.create-auto invocation-id/9db76483e82c490f9d34ad2fdffeda72 environment/None environment-version/None interactive/True from-script/False python/3.9.7 term/xterm-256color (Linux 5.13.13)'
== headers end ==
== body start ==
{"cluster": {"autopilot": {"enabled": true}, "ipAllocationPolicy": {"clusterSecondaryRangeName": "my-pods", "createSubnetwork": false, "servicesSecondaryRangeName": "my-services", "useIpAliases": true}, "masterAuthorizedNetworksConfig": {"enabled": true}, "name": "my-cluster", "network": "projects/hmplayground/global/networks/my-vpc", "nodePools": [{"config": {"oauthScopes": ["https://www.googleapis.com/auth/devstorage.read_only", "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring"], "serviceAccount": "[email protected]"}, "initialNodeCount": 1, "name": "default-pool"}], "privateClusterConfig": {"enablePrivateNodes": true, "masterIpv4CidrBlock": "172.16.0.16/28"}, "releaseChannel": {"channel": "REGULAR"}, "subnetwork": "projects/hmplayground/regions/europe-west3/subnetworks/my-subnet"}, "parent": "projects/hmplayground/locations/europe-west3"}
== body end ==
==== request end ====
---- response start ----
status: 200
-- headers start --
-content-encoding: gzip
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
cache-control: private
content-length: 446
content-type: application/json; charset=UTF-8
date: Tue, 14 Sep 2021 14:03:39 GMT
server: ESF
transfer-encoding: chunked
vary: Origin, X-Origin, Referer
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-xss-protection: 0
-- headers end --
-- body start --
{
  "name": "operation-1631628219731-15754d1b",
  "zone": "europe-west3",
  "operationType": "CREATE_CLUSTER",
  "status": "RUNNING",
  "selfLink": "https://container.googleapis.com/v1/projects/306799302406/locations/europe-west3/operations/operation-1631628219731-15754d1b",
  "targetLink": "https://container.googleapis.com/v1/projects/306799302406/locations/europe-west3/clusters/my-cluster",
  "startTime": "2021-09-14T14:03:39.731893675Z"
}

-- body end --
total round trip time (request+response): 4.417 secs
---- response end ----

tSte avatar Sep 14 '21 14:09 tSte

hi, I ran into the same issue, not being able to assign a custom service account to an autopilot gke cluster with terraform v1.0.1.

@slevenick Is there any update on this subject?

Regards, C.

cvega77 avatar Oct 25 '21 19:10 cvega77

Hi, Any update about this bug? I need to create an autopilot cluster with a custom service account. With a gcloud command it's working. I understand that the API used by terraform is different from the one used by gcloud, is it right? With the last terraform version I still have this issue. Regards

ngarv avatar Nov 24 '21 09:11 ngarv

@nilsoulinou I created GKE cluster via gcloud CLI and terraform imported into configuration. This works.

@venkykuberan @slevenick is this still considered active?

tSte avatar Dec 01 '21 18:12 tSte

@tSte are you saying that you can't create GKE in autopilot mode with a non default service account directly with google provider and you have to create it with gcloud command then import it with terraform ?

If yes, i think this issue is still active because i expect it to be performed with terraform and not having to do manual steps.

lrk avatar Dec 06 '21 10:12 lrk

@lrk you're right - all of our clusters are currently created via gcloud CLI and ten imported and managed via TF.

tSte avatar Dec 06 '21 17:12 tSte

Are there any updates to this thread, on the ability to use non default SA to provision a Autopilot GKE?

sandy-0007 avatar Jan 27 '22 06:01 sandy-0007

The issue occurs because Terraform is using a deprecated field to set up the service account while the API no longer respects this field when the cluster type is Autopilot.

The following payload to the API will create the cluster succesfully:

{
    "cluster": {
        "autopilot": {
            "enabled": true
        },
        "binaryAuthorization": {
            "enabled": false
        },
        "ipAllocationPolicy": {
            "clusterSecondaryRangeName": "cluster-1",
            "servicesSecondaryRangeName": "service-1",
            "useIpAliases": true
        },
        "legacyAbac": {
            "enabled": false
        },
        "maintenancePolicy": {
            "window": {}
        },
        "masterAuthorizedNetworksConfig": {
            "cidrBlocks": [
                {
                    "cidrBlock": "172.16.0.0/16"
                }
            ],
            "enabled": true
        },
        "name": "gke-cluster",
        "network": "projects/network-host-0372/global/networks/production",
        "networkConfig": {
            "datapathProvider": "ADVANCED_DATAPATH",
            "enableIntraNodeVisibility": true
        },
        "nodePools":[
         {
            "config":{
               "oauthScopes":[
                  "https://www.googleapis.com/auth/devstorage.read_only",
                  "https://www.googleapis.com/auth/logging.write",
                  "https://www.googleapis.com/auth/monitoring"
               ],
               "serviceAccount":"[email protected]"
            },
            "initialNodeCount":1,
            "name":"default-pool"
         }
        ],
        "privateClusterConfig": {
            "enablePrivateEndpoint": true,
            "enablePrivateNodes": true,
            "masterGlobalAccessConfig": {
                "enabled": true
            },
            "masterIpv4CidrBlock": "10.128.65.0/28"
        },
        "shieldedNodes": {
            "enabled": true
        },
        "subnetwork": "projects/network-host-0372/regions/europe-west3/subnetworks/node-1"
    }
}

However, Terraform generates the following payload:

{
 "cluster": {
  "autopilot": {
   "enabled": true
  },
  "binaryAuthorization": {
   "enabled": false
  },
  "ipAllocationPolicy": {
   "clusterSecondaryRangeName": "cluster-1",
   "servicesSecondaryRangeName": "service-1",
   "useIpAliases": true
  },
  "legacyAbac": {
   "enabled": false
  },
  "maintenancePolicy": {
   "window": {}
  },
  "masterAuthorizedNetworksConfig": {
   "cidrBlocks": [
    {
     "cidrBlock": "172.16.0.0/16"
    }
   ],
   "enabled": true
  },
  "name": "gke-cluster",
  "network": "projects/network-host-0372/global/networks/production",
  "networkConfig": {
   "datapathProvider": "ADVANCED_DATAPATH",
   "enableIntraNodeVisibility": true
  },
  "nodeConfig": {
   "oauthScopes": [
    "https://www.googleapis.com/auth/monitoring",
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/logging.write"
   ],
   "serviceAccount": "[email protected]"
  },
  "privateClusterConfig": {
   "enablePrivateEndpoint": true,
   "enablePrivateNodes": true,
   "masterGlobalAccessConfig": {
    "enabled": true
   },
   "masterIpv4CidrBlock": "10.128.65.0/28"
  },
  "shieldedNodes": {
   "enabled": true
  },
  "subnetwork": "projects/network-host-0372/regions/europe-west3/subnetworks/node-1"
 }
}

The difference between these two is, the former is using the nodeConfig property, which is already deprecated, and the latter is using nodePools.config. Apparently Autopilot does not recognise the deprecated property, although this is is not documented.

Perhaps Terraform provider should get away from the deprecated property to avoid not only this one but also other any future issues @slevenick. There is already TODO item here for that :)

cagataygurturk avatar Feb 14 '22 12:02 cagataygurturk

Thinking about this a little bit more, I believe the API should not simply ignore the field although it is deprecated. I have also created an issue https://issuetracker.google.com/issues/219237911. Impacted people may consider starring the issue.

cagataygurturk avatar Feb 14 '22 12:02 cagataygurturk

@slevenick: Updating assignment because I think this has gone inactive, please correct this if you're still working on it!

Perhaps Terraform provider should get away from the deprecated property to avoid not only this one but also other any future issues @slevenick. here is already TODO item here for that :)

The TODO in that file was for another tool that the MM generator used to be used for- Terraform's implementation is handwritten. https://github.com/hashicorp/terraform-provider-google/issues/7185 and https://github.com/hashicorp/terraform-provider-google/issues/4963 (roughly) track potential removal of the field. We haven't gone forward with it because of the projected impact- requiring users to rewrite configs, and recreating their clusters if they get it wrong- and the lack of signal from the API that they'll actually remove the field.

The API respecting the service account in one case and not the other is confusing and frustrating as both those messages should have created the same cluster- thanks for filing upstream. I think there's a workaround in the provider today, luckily, as you should be able to create clusters with node_pools set. We're passing the message directly on to the API, and the transformation to config highlighted in https://github.com/hashicorp/terraform-provider-google/issues/4963#issuecomment-557268286 should be possible to produce a working payload.

rileykarson avatar Mar 11 '22 01:03 rileykarson

Hi all, the underlying API issue seems to be resolved according to here:

https://issuetracker.google.com/issues/219237911#comment3

If someone can confirm that on Terraform side this also fixed the issue, then this one can be closed.

cagataygurturk avatar Mar 17 '22 09:03 cagataygurturk

Hi all, I'm new in this community. It seems that the bug has been fixed. Could you tell me the terraform release version or google provider version to use, in order to perform the test with a custom SA for GKE autopilot?

Regards

Nils

ngarv avatar Mar 24 '22 15:03 ngarv

Hi, I still have the default service account attached to the GKE Cluster with these versions:

Terraform v1.1.7 on linux_amd64

  • provider registry.terraform.io/hashicorp/google v4.15.0
  • provider registry.terraform.io/hashicorp/google-beta v4.15.0

with the following terraform block:

resource "google_container_cluster" "private" {
 name                     = "XXXXX"
 location                 = var.region

 network                  = google_compute_network.xxxx.id
 subnetwork               = google_compute_subnetwork.xxxx.id

 node_config {
   service_account = google_service_account.yyy.email
   oauth_scopes    = [
     "https://www.googleapis.com/auth/cloud-platform"
   ]
 }

 private_cluster_config {
   enable_private_endpoint = true
   enable_private_nodes    = true
   master_ipv4_cidr_block  = "XXX.XXX.XXX.XXX/28"
 }

 master_authorized_networks_config {
   cidr_blocks {
     cidr_block = "XXX.XXX.XXX.XXX/24"
     display_name = "xxxx" 
   }
   cidr_blocks {
     cidr_block = "XXX.XXX.XXX.XXX/16"
     display_name = "xxxx" 
   }
 }

 # Enable Autopilot for this cluster
 enable_autopilot = true

 vertical_pod_autoscaling {
   enabled = true
 }
 # Configuration of cluster IP allocation for VPC-native clusters
 ip_allocation_policy {
   cluster_ipv4_cidr_block  = "XXX.XXX.XXX.XXX/16"
   services_ipv4_cidr_block = "XXX.XXX.XXX.XXX/24"
 }

 # Configuration options for the Release channel feature, which provide more control over automatic upgrades of your GKE clusters.
 release_channel {
   channel = "REGULAR"
 }
}

Should I need additional informations?

Nils

ngarv avatar Mar 29 '22 15:03 ngarv

If you feel the issue was not fixed, please drop a comment to https://issuetracker.google.com/issues/219237911#comment3

cagataygurturk avatar Mar 29 '22 16:03 cagataygurturk

I've recently run into this issue myself. Below are my findings

Terraform v1.1.5 on darwin_amd64

  • provider registry.terraform.io/hashicorp/external v2.2.2
  • provider registry.terraform.io/hashicorp/google v4.22.0
  • provider registry.terraform.io/hashicorp/google-beta v4.22.0
  • provider registry.terraform.io/hashicorp/kubernetes v2.11.0
  • provider registry.terraform.io/hashicorp/null v3.1.1
  • provider registry.terraform.io/hashicorp/random v3.2.0

Like in https://github.com/hashicorp/terraform-provider-google/issues/9505#issuecomment-1039029610 I noticed the payload that was being generated for a new autopilot cluster was the following:

POST /v1beta1/projects/{project_id}/locations/us-west1/clusters?alt=json&prettyPrint=false HTTP/1.1
Host: container.googleapis.com
...

{
 "cluster": {
  "addonsConfig": {
   "horizontalPodAutoscaling": {
    "disabled": false
   },
   "httpLoadBalancing": {
    "disabled": false
   }
  },
  "autopilot": {
   "enabled": true
  },
  "binaryAuthorization": {
   "enabled": false
  },
  "ipAllocationPolicy": {
   "clusterSecondaryRangeName": "network-pods",
   "servicesSecondaryRangeName": "network-services",
   "useIpAliases": true
  },
  "legacyAbac": {
   "enabled": false
  },
  "locations": [
   "us-west1-a",
   "us-west1-b",
   "us-west1-c"
  ],
  "loggingService": "logging.googleapis.com/kubernetes",
  "maintenancePolicy": {
   "window": {
    "dailyMaintenanceWindow": {
     "startTime": "05:00"
    }
   }
  },
  "masterAuth": {
   "clientCertificateConfig": {}
  },
  "masterAuthorizedNetworksConfig": {},
  "monitoringService": "monitoring.googleapis.com/kubernetes",
  "name": "us-west1-dev-autopilot-test",
  "network": "projects/{project_id}/global/networks/anthos-network",
  "networkConfig": {
   "defaultSnatStatus": {
    "disabled": false
   },
   "enableIntraNodeVisibility": true
  },
  "nodeConfig": {
   "oauthScopes": [
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/logging.write",
    "https://www.googleapis.com/auth/monitoring",
    "https://www.googleapis.com/auth/service.management.readonly",
    "https://www.googleapis.com/auth/servicecontrol",
    "https://www.googleapis.com/auth/trace.append"
   ]
  },
  "notificationConfig": {
   "pubsub": {}
  },
  "releaseChannel": {
   "channel": "REGULAR"
  },
  "shieldedNodes": {
   "enabled": true
  },
  "subnetwork": "projects/{project_id}/regions/us-west1/subnetworks/anthos-subnet",
  "verticalPodAutoscaling": {
   "enabled": true
  }
 }
}

Looking at the documentation to create a cluster at [1] it lists the command to be used as

gcloud container clusters create-auto CLUSTER_NAME \
    --region REGION \
    --project=PROJECT_ID

So that means that the Terraform provider using the default cluster creation API [2] that doesn't list any flags to specify autopilot when it should be using [3] instead. I've verified that using the following command will create an Autopilot cluster with a correct service account.

gcloud container --project {project_id} clusters create-auto autopilot-test \
--region=us-west1 \
--release-channel=regular \
--service-account=cluster-admin@{project_id}.iam.gserviceaccount.com \
--network=test-network \
--subnetwork=test-subnet \
--cluster-secondary-range-name=network-pods \
--services-secondary-range-name=network-services 

While I see that there is discussion of a deprecation at [4] it seems like a quicker solution may to use the API specified in [3] which currently works.

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/creating-an-autopilot-cluster#gcloud [2] https://cloud.google.com/sdk/gcloud/reference/container/clusters/create [3] https://cloud.google.com/sdk/gcloud/reference/container/clusters/create-auto [4] https://issuetracker.google.com/issues/219237911?pli=1

deekthesqueak avatar May 30 '22 04:05 deekthesqueak

Any update on this?

X4mp avatar Jul 20 '22 08:07 X4mp

@rileykarson - happy to take this on ... I think I have the answer

mastersingh24 avatar Aug 05 '22 19:08 mastersingh24

Why don't we revert https://github.com/GoogleCloudPlatform/magic-modules/pull/4894 but do the same workaround for node_pool instead. I have already tested this approach, it works. I know it is not optimal but at least make things work. PR is in progress. Based on the excellent comment https://github.com/hashicorp/terraform-provider-google/issues/9505#issuecomment-1039029610

node_pool {
  name               = "default-pool"
  initial_node_count = 1

  node_config {
    service_account = "[email protected]"
    oauth_scopes    = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

modax avatar Sep 28 '22 15:09 modax

After struggling with this some time today, I believe I've found the key. The request must define one node pool with name "default-pool". Using any other name results in getting a different "default-pool" configured with autopilot defaults.

Other findings:

  1. setting autoscaling.autoprovisioning_node_pool_defaults.service_account is allowed by the API but seems to do nothing
  2. The above method (define a single nodepool named "default-pool") is sufficient to affect the resulting cluster's settings for all three of autoscaling.autoprovisioning_node_pool_defaults.service_account, the deprecated node_config.service_account , and (unsurprisingly) node_pools[0].config.service_account.
  3. GKE API rejects setting autoscaling_profile for autopilot clusters only if it's not set to BALANCED.

This took me forever to find because I'm using the google-cloud-provided gke module that uses "default-node-pool" as the name of the singular default nodepool.

ghost avatar Oct 12 '22 18:10 ghost

I was able to terraform an autopiloted gke cluster by removing the conflict between enable_autopilot and node_pools. Empircally, the resource_limits and autoscaling_profile subsettings of node_pools does in fact conflict with enable_autopilot, so I pushed the conflict down to those.

This is the resulting patch: https://github.com/bukzor-sentryio/terraform-provider-google-beta/commit/pr-9505-autopilot-with-nodepools

While that "worked" the resulting diff behavior is entirely borked. I have currently two clusters terraformed, and terraform-plan wants to tear down the one that has the correct service account (because node_pools[0].metadata changed) and it believes the other cluster with the wrong service account needs no changes.

ghost avatar Oct 12 '22 21:10 ghost

Unfortunately the Google issue was closed with Won't fix. It seems the only way is to fix this on Terraform side.

cagataygurturk avatar Oct 18 '22 09:10 cagataygurturk

Hey all, the correct solution here is to pass cluster_autoscaling.auto_provisioning_defaults.service_account which is the Autopilot friendly way to pass service accounts.

You don't have control over node pools in Autopilot (and there may not even be one at first) so passing with node_pools no longer makes sense. The API does partially support passing via default pool for legacy reasons as this group has discovered but it's not a great approach and won't work nicely with terraform.

I'll take the fix @rileykarson @mastersingh24

JeremyOT avatar Oct 22 '22 07:10 JeremyOT

@JeremyOT - yeah - I've had something in the works, but was waiting to see how things played out. Here's a draft PR: https://github.com/GoogleCloudPlatform/magic-modules/pull/6732

mastersingh24 avatar Oct 22 '22 10:10 mastersingh24

@mastersingh24 Ok cool - I did something similar but didn't add conflicts on the CA subfields, and defaulted CA.enabled=true when autopilot is enabled and no value is supplied. Both work, I have no real preference. Trying to get it out there before kubecon kicks off

GoogleCloudPlatform/magic-modules#6733

JeremyOT avatar Oct 22 '22 17:10 JeremyOT

Let's push yours through @JeremyOT .. looks good and I was going to add the defaults as well.

mastersingh24 avatar Oct 22 '22 19:10 mastersingh24

@JeremyOT, @mastersingh24 Fair enough that your approach looks better. If @JeremyOT PR works, setting cluster_autoscaling.auto_provisioning_defaults.service_account makes much more sense then messing with node_pools which I did in https://github.com/GoogleCloudPlatform/magic-modules/pull/6611. Not that I'm a contributor but I have added a few comments on https://github.com/GoogleCloudPlatform/magic-modules/pull/6733 though.

modax avatar Oct 22 '22 20:10 modax

I don't think this is fixed. I've built the provider with #13024 and am trying to provision an Autopilot cluster. We'd previously deleted the default GCE SA from the project entirely, and get

Error: googleapi: Error 400: Service account "[email protected]" does not exist., badRequest

even when specifying a custom SA.

mgoodness avatar Nov 16 '22 22:11 mgoodness