terraform-provider-google icon indicating copy to clipboard operation
terraform-provider-google copied to clipboard

Adding `google_compute_resource_policy` to existing instance fails with wrong service account

Open rhoriguchi opened this issue 1 year ago • 4 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version

Terraform v1.7.2-dev
on darwin_arm64

Your version of Terraform is out of date! The latest version
is 1.7.3. You can update by downloading from https://www.terraform.io/downloads.html

Affected Resource(s)

  • google_compute_resource_policy
  • google_compute_instance

Terraform Configuration

...
    "google_compute_instance": {
      "vm_name": {
        ...
        "resource_policies": [
          "${google_compute_resource_policy.vm_name-scheduling-policy.id}"
        ],
        ...
      }
    },
 "vm_name-scheduling-policy": {
        "//": {
          "metadata": {
            "path": "project/vm_name-service/vm_name-scheduling-policy",
            "uniqueId": "vm_name-scheduling-policy"
          }
        },
        "instance_schedule_policy": {
          "time_zone": "Europe/Zurich",
          "vm_start_schedule": {
            "schedule": "0 7 * * MON-FRI"
          },
          "vm_stop_schedule": {
            "schedule": "0 19 * * *"
          }
        },
        "name": "vm_name-scheduling-policy",
        "region": "europe-west6"
      }
    },
...

Debug Output

Error: Error adding resource policies: googleapi: Error 412: Compute Engine System service account [email protected] needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation., conditionNotMet

  with google_compute_instance.vm_name (vm_name-service/vm_name),
  on cdk.tf.json line 314, in resource.google_compute_instance.vm_name (vm_name-service/vm_name):
 314:       }

Expected Behavior

Add service policy to existing compute instance.

Actual Behavior

Failing to add service policy to existing compute instance. The issue is that it uses the default ([email protected]) compute service account for the project while locally executing the plan with a custom service account. Everything else is executed with the custom service account (not on a compute instance). Why is the default compute service account used? I'm aware that adding a policy recreates the instance.

Steps to reproduce

  1. terraform apply

Important Factoids

No response

References

https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_resource_policy

rhoriguchi avatar Feb 13 '24 14:02 rhoriguchi

@rhoriguchi can you share the complete config with before and after the update as well as your debug log?

The error complains permissions. Did you check if that account has the mentioned permission? Reading the error, it seems this service account is used to stop & restart the instance when some config changes are applied

googleapi: Error 412: Compute Engine System service account [email protected] needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation., conditionNotMet

The error may not be limited to the changes mentioned in the subject. You may see the same error for other changes that require machine reboot

edwardmedia avatar Feb 13 '24 15:02 edwardmedia

Sure thing @edwardmedia, I've created a test deployment. I've tried several combinations of it in different projects. Same behavior when trying to attach it to an existing VM. The service account used for the deployment has Compute Instance Admin (v1) as mentioned in the docs. The service account mentioned in the error is not the one used for the deployment, but the compute engine default service account.

Log File

TypeScript
import { ComputeInstance } from '@cdktf/provider-google/lib/compute-instance';
import { ComputeResourcePolicy } from '@cdktf/provider-google/lib/compute-resource-policy';
import { Construct } from 'constructs';
import { GoogleProvider } from '@cdktf/provider-google/lib/provider';
import { TerraformStack } from 'cdktf';

export class TestDeployment extends TerraformStack {
  constructor(scope: Construct) {
    super(scope, 'test-deployment');

    new GoogleProvider(this, 'google', {
      credentials: process.env.GCP_SERVICE_ACCOUNT_CREDENTIALS,
      project: 'SOME-PROJECT',
    });

    const resourcePolicy = new ComputeResourcePolicy(this, 'test-policy', {
      name: 'test-policy',
      region: 'europe-west6',

      instanceSchedulePolicy: {
        timeZone: 'Europe/Zurich',

        vmStartSchedule: {
          schedule: '0 7 * * MON-FRI',
        },

        vmStopSchedule: {
          schedule: '0 19 * * *',
        },
      },
    });

    new ComputeInstance(this, 'test-vm', {
      name: 'test-vm',
      machineType: 'n2-standard-4',
      zone: 'europe-west6-a',

      bootDisk: {
        initializeParams: {
          image: 'debian-cloud/debian-11',
        },
      },

      networkInterface: [
        {
          network: 'default',
        },
      ],

      resourcePolicies: [resourcePolicy.id],
    });
  }
}
Terraform HCL
{
  "//": {
    "metadata": {
      "backend": "local",
      "stackName": "test-deployment",
      "version": "0.20.2"
    },
    "outputs": {
    }
  },
  "provider": {
    "google": [
      {
        "credentials": "XXXXXXXXXXXXX"
        "project": "SOME-PROJECT"
      }
    ]
  },
  "resource": {
    "google_compute_instance": {
      "test-vm": {
        "//": {
          "metadata": {
            "path": "test-deployment/test-vm",
            "uniqueId": "test-vm"
          }
        },
        "boot_disk": {
          "initialize_params": {
            "image": "debian-cloud/debian-11"
          }
        },
        "machine_type": "n2-standard-4",
        "name": "test-vm",
        "network_interface": [
          {
            "network": "default"
          }
        ],
        "resource_policies": [
          "${google_compute_resource_policy.test-policy.id}"
        ],
        "zone": "europe-west6-a"
      }
    },
    "google_compute_resource_policy": {
      "test-policy": {
        "//": {
          "metadata": {
            "path": "test-deployment/test-policy",
            "uniqueId": "test-policy"
          }
        },
        "instance_schedule_policy": {
          "time_zone": "Europe/Zurich",
          "vm_start_schedule": {
            "schedule": "0 7 * * MON-FRI"
          },
          "vm_stop_schedule": {
            "schedule": "0 19 * * *"
          }
        },
        "name": "test-policy",
        "region": "europe-west6"
      }
    }
  },
  "terraform": {
    "backend": {
      "local": {
        "path": "/PATH/terraform.test-deployment.tfstate"
      }
    },
    "required_providers": {
      "google": {
        "source": "google",
        "version": "5.13.0"
      }
    }
  }
}

rhoriguchi avatar Feb 14 '24 15:02 rhoriguchi

@rhoriguchi how many GCP projects are involved in your deployment? Below account is Compute Engine Service Agent which was created when you enabled the Compute Engine API on project 224845064652. Do the target resources reside on the same project?

[email protected]

If not, what kind of relationships among the projects? Crossing different projects, you do need to consider build proper IAMs among them.

edwardmedia avatar Feb 14 '24 22:02 edwardmedia

We are using a service principal from another project that has Editor permissions on the project we are deploying to. Everything can be deployed with no issues.

However when adding a resource policy to an instance it tries to use the Compute Engine default service account (so the service account GCP creates by default) to restart the instance instead of the service principal we are using to deploy the resources.

So the only way to fix the issue currently would be granding this service account VM restart permissions, which we do not want for normal operations outside of the deployment. Why isn't the service principal used for the terraform deployment used when adding a resource policy to an instance?

rhoriguchi avatar Feb 19 '24 09:02 rhoriguchi

I would guess that's what's happening here is that, under the hood, the Compute API is trying to use its default service account for the project (regardless of who the authenticated user is.) You could likely confirm this by trying to use gcloud to make the same POST request that Terraform is failing on - it should fail in the same way.

And if it doesn't, that would give more information about what is causing the failure.

melinath avatar Feb 29 '24 21:02 melinath

@rhoriguchi As I saw you are looking to add a service policy to existing compute instance avoiding using the 'compute engine default service account'.

You should check that your ADCs configuration is correct according to this https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_reference (Authentication section) to ensure that you are taking the correct compute engine service account and not a the default, because if it is not declared there it is going to take the default.

Or specify it directly in the provider block like this:

provider "google" {
  credentials = file("/path/to/your/keyfile.json")
  project     = "your-project-id"
  region      = "your-region"
  zone        = "your-zone"
  service_account_email = "[email protected]"
}

ggtisc avatar Mar 01 '24 00:03 ggtisc

I would guess that's what's happening here is that, under the hood, the Compute API is trying to use its default service account for the project (regardless of who the authenticated user is.) You could likely confirm this by trying to use gcloud to make the same POST request that Terraform is failing on - it should fail in the same way.

And if it doesn't, that would give more information about what is causing the failure.

I've tried to reproduce it and I'm getting exaclty the same response using the API.

> curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token --impersonate-service-account=XXXXXXXXX@SOME-PROJECT.iam.gserviceaccount.com)" \
    -H 'Content-Type: application/json; charset=utf-8' \
    -d '{  "canIpForward": false,  "deletionProtection": false,  "disks": [   {    "autoDelete": true,    "boot": true,    "initializeParams": {     "sourceImage": "projects/debian-cloud/global/images/family/debian-11"    },    "mode": "READ_WRITE"   }  ],  "machineType": "projects/SOME-PROJECT/zones/europe-west6-a/machineTypes/n2-standard-4",  "metadata": {},  "name": "test-vm",  "networkInterfaces": [   {    "network": "projects/SOME-PROJECT/global/networks/default"   }  ],  "params": {},  "resourcePolicies": [   "projects/SOME-PROJECT/regions/europe-west6/resourcePolicies/test-policy"  ],  "scheduling": {   "automaticRestart": true  },  "tags": {} }' \
    'https://compute.googleapis.com/compute/v1/projects/SOME-PROJECT/zones/europe-west6-a/instances?alt=json&prettyPrint=false'

WARNING: This command is using service account impersonation. All API calls will be executed as [[email protected]].
{
  "error": {
    "code": 412,
    "message": "Compute Engine System service account [email protected] needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation.",
    "errors": [
      {
        "message": "Compute Engine System service account [email protected] needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation.",
        "domain": "global",
        "reason": "conditionNotMet",
        "location": "If-Match",
        "locationType": "header"
      }
    ]
  }
}

EDIT: Update anonymization of service account name to make it clearer

rhoriguchi avatar Mar 05 '24 11:03 rhoriguchi

@rhoriguchi it looks like in that example, you're impersonating the compute service account - the behavior I was speculating about was whether, if you are impersonating [email protected] (like when you were using Terraform), you still get an error message about the compute service account. Could you try that & report the results?

melinath avatar Mar 05 '24 16:03 melinath

@melinath sorry about that. While anonymizing the output I didn't keep the 2 accounts different. Please take a look at my previous comment I've updated it https://github.com/hashicorp/terraform-provider-google/issues/17260#issuecomment-1978571105

rhoriguchi avatar Mar 06 '24 10:03 rhoriguchi

Thanks! In that case, this doesn't seem to be a bug in the Terraform provider, just a thing about how the API works.

melinath avatar Mar 06 '24 16:03 melinath

Considering this a feature request for the service team to review. It seems that the provider is working as expected, but the configured service account is not used, which can cause unexpected behavior when working across multiple projects.

roaks3 avatar Mar 08 '24 15:03 roaks3

After spending 30 minutes with this same issue of permissions, it became clear to me that the google_compute_default_service_account is not the SA that we actually need, but the "Compute Engine Service Agent" which has the form of [email protected]

Ideally we need a data.google_compute_engine_service_agent source to get the right service account, especially as it sounds so much like the "Compute Default Service Account", this is likely to cause confusion. (Thanks Google).

timwsuqld avatar Mar 11 '24 10:03 timwsuqld

After spending 30 minutes with this same issue of permissions, it became clear to me that the google_compute_default_service_account is not the SA that we actually need, but the "Compute Engine Service Agent" which has the form of [email protected]

Ideally we need a data.google_compute_engine_service_agent source to get the right service account, especially as it sounds so much like the "Compute Default Service Account", this is likely to cause confusion. (Thanks Google).

I encountered this same problem and came to the same conclusion. Aligned that this is quite unclear

hervedevos avatar Mar 19 '24 10:03 hervedevos

After spending 30 minutes with this same issue of permissions, it became clear to me that the google_compute_default_service_account is not the SA that we actually need, but the "Compute Engine Service Agent" which has the form of [email protected]

Ideally we need a data.google_compute_engine_service_agent source to get the right service account, especially as it sounds so much like the "Compute Default Service Account", this is likely to cause confusion. (Thanks Google).

I also ran into this issue while trying to launch a GCE instance by resizing the respective MIG. The MIG uses an instance template that applies a KMS CMEK from a different project for encrypting the boot disk.

The instance launch fails with a KMS permission-related error message, which is completely misleading: while Cloud Logging says that the principal <project_number>@cloudservices.gserviceaccount.com is missing KMS key permissions, granting those permissions to that principal changes nothing. Also granting those permissions to the current "Compute Default Service Account" (e.g. obtained by the current data source) changes nothing. The problem is resolved by granting permissions to [email protected].

All 3 types of default service accounts are described here: click. In a nutshell, there is:

  • A "Compute Engine default service account" ([email protected]) - this is the one currently extracted by the data source
  • A "Google API Service Agent" ([email protected]) - this is the one that is misleadingly reported by Cloud Logging
  • A "Compute Engine Service Agent" ([email protected]) - this is the correct one

Now, while it is perfectly possible to statically construct the service agent string, a shortcut (something like data.google_compute_default_service_agent) would, of course, be much nicer.

Additional info: click

pspot2 avatar Jun 05 '24 15:06 pspot2