magic-modules icon indicating copy to clipboard operation
magic-modules copied to clipboard

container: fix missing updates for `google_container_cluster.node_config` subfields

Open wyardley opened this issue 1 year ago • 11 comments

This resolves an issue where many subfields of node_config in a cluster (which affects the default node-pool "default-pool" when remove_default_node_pool is not set to false) don't support updates properly, and also allows 3 subfields which had been previously set to force recreation of the default node pool (because updates were non-functional) to be updated in-place.

Some acceptance tests are added, and some existing tests have been adjusted to confirm that the behavior is the same between google_container_cluster.node_config and google_container_node_pool.node_config.

Fixes hashicorp/terraform-provider-google#19225 Fixes hashicorp/terraform-provider-google#18208 Fixes hashicorp/terraform-provider-google#19056

Possible fix for hashicorp/terraform-provider-google#17522

Followup to #11826 where the code used by the regular node pool update code was broken out.

Release Note Template for Downstream PRs (will be copied)

container: fixed missing in-place updates for some `google_container_cluster.node_config` subfields
container: added in-place update support for `labels`, `resource_manager_tags` and `workload_metadata_config` in `google_container_cluster.node_config` 

References

  • https://github.com/hashicorp/terraform-provider-google/issues/19225
  • #11272
  • #11717
  • #11826

b/361634104

wyardley avatar Oct 16 '24 05:10 wyardley

Hello! I am a robot. Tests will require approval from a repository maintainer to run.

@roaks3, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.

You can help make sure that review is quick by doing a self-review and by running impacted tests locally.

github-actions[bot] avatar Oct 16 '24 05:10 github-actions[bot]

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 1 file changed, 5 insertions(+), 123 deletions(-)) google-beta provider: Diff ( 1 file changed, 5 insertions(+), 123 deletions(-))

modular-magician avatar Oct 17 '24 17:10 modular-magician

Tests analytics

Total tests: 212 Passed tests: 199 Skipped tests: 13 Affected tests: 0

Click here to see the affected service packages
  • container

🟢 All tests passed!

View the build log

modular-magician avatar Oct 17 '24 17:10 modular-magician

@roaks3 Sure!

  1. The node_config subfield of google_container_cluster controls the settings for the default node-pool when remove_default_node_pool is not set to false. The linked issue mentions how many of these fields previously didn't support updates when they should, causing an inability to control those settings (which would be updatable if they were on a standalone nodepool). It's actually the exact same resource. As suggested, I laid the foundation for this change in #11826 by moving the code for when node_cofig in the container_node_pool resource has changes into node_config into a standalone function. #11272 also has some background / context. In theory, the tests for updates on some of these fields in the node pool resource should behave the same way, and some of the existing tests (including TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates and, I believe, TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates` should also cover this scenario), but I'm happy to add or update some existing tests (within reason) if that helps create more confidence.
  2. Have you read through the linked issues / PRs? Those have got a lot of discussion that hopefully outlines both what the issue is, and some other PRs that have addressed part of this issue. Some of the folks at Google who have been involved in previous discussions / PRs related to this are @rileykarson and @melinath, who probably have the most context here. In particular, see this comment. I've linked to some related issues directly in the description now as well under "References".
  3. I think the biggest thing is: there are obviously some test gaps here, most of which relate to things that never worked in this context (and maybe in some cases, aren't tested in the container_node_pool resource either for the use case of updating). Again, within reason, I am happy to add some additional test coverage if it helps improve confidence in this change.

So, tl;dr this should make the behavior of the default node pool within the container_cluster resource more consistent with the behavior of independently defined or additional node pools, and fix some cases where an existing setting for default-pool could not be updated in-place with failure / permadiff.

Let me know if that helps in terms of what you were looking for.

A bit broader context: google_container_cluster having the ability to define a node pool resource within the cluster resource (or outside of it) has been a persistent cause of headaches with this resource (one reason that many people choose to delete the default pool and manage nodepools separately).

wyardley avatar Oct 17 '24 17:10 wyardley

@wyardley thanks for the response!

Yea, I think I understand the intent to update this config, and some of the nuance where this is shared between resources. But I am still wondering if you could give a brief description of the actual code changes here. (I could probably deduce the answer by reading all of the related references, but it would take quite a bit longer for me to wrap my head around everything for this review). For example, why are all of these lines being removed? Are they effectively already copied under the function you are replacing them with, or do I need to review those functions as well to determine what behavior has changed? In short, what could break?

Acknowledged on the testing. Once I know a bit more about what is changing that will help me weigh in, but I'm tempted to say this is the type of change that would be covered well enough by existing tests. We may want to add some of these fields to existing tests though, if they are any that aren't already included.

roaks3 avatar Oct 18 '24 21:10 roaks3

Yea, I think I understand the intent to update this config, and some of the nuance where this is shared between resources. But I am still wondering if you could give a brief description of the actual code changes here.

So the bug here is that not all of the fields that support update in place were being updated when defined for the default pool via node_config. Some of the code, such as the fields I added in #11272 had the update implemented and had tests for it (because @melinath asked me to add them). In those cases, they would have been separately implemented in both places already, but in other cases, the code was already missing from one place.

The plan therefore was to first extract out the node_config updates in container_node_pool to a separate function as one step, and then call that same function for node_config within container_cluster.

So the simple answer to your question is that the code being pulled out should be a functionally equivalent subset of what's now being called. Some of the code should be identical or close to identical (other than the semantics around prefix being a little different and so on) to the code in nodePoolNodeConfigUpdate() that's now being called (e.g., node_config.0.image_type, node_config.0.kubelet_config, which I fixed in #11697 after adding part of it for #11272, and gcfs_config, fixed in #11717) , whereas some of the code (essentially, missing code to do the remainder of the updates) was simply not present here previously (e.g., node_config.0.tags, node_config.0.resource_manager_tags, the block with disk_size_gb, disk_type, etc.), which is the part that's being fixed. But the actual "thing" being operated on (a node pool) is the same in both cases (a node pool).

For the parts that were already implemented in both places, you should be able to compare the code and see that those sections are pretty similar.

For example, compare: https://github.com/GoogleCloudPlatform/magic-modules/blob/c438c6eb725d6229ffdccdd171092e3f6be6ac7d/mmv1/third_party/terraform/services/container/resource_container_cluster.go.tmpl#L3846-L3892 to https://github.com/GoogleCloudPlatform/magic-modules/blob/c438c6eb725d6229ffdccdd171092e3f6be6ac7d/mmv1/third_party/terraform/services/container/node_config.go.tmpl#L2415-L2448

I checked and when the default node pool is defined via node_pool {} within container_cluster, that already functions as it should.

linux_node_config and the disk size etc. ones may need to be double-checked (I can play with it a little), though I think this code should handle it correctly too - in my understanding, if updating it for externally defined nodepools works, it should also work for the default pool. And, to whatever extent these things are tested under container_node_pool, my understanding is that those tests should more or less cover their use here.

However, some of these resources may not have tests in either place that test the path where the resource is updated in-place.

wyardley avatar Oct 18 '24 22:10 wyardley

Drive-by comment: @wyardley will existing tests catch this error? or is this fixing bugs that were not caught by the existing tests?

melinath avatar Oct 18 '24 22:10 melinath

@melinath: There were presumably no (update) tests for the things that were broken before, at least not directly within container_cluster.node_config, since they’d be failing, and I think when we discussed, there was a preference to not add (skipped) failing tests.

~I’m not sure how comprehensive the tests under container_node_pool.node_config are - I’d guess there are missing ones there as well.~ But in any cases where they do exist, presumably testing one should test the other once they’re using the same code.

~I ran through the tests that do have update tests, but agree there are probably gaps.~

Edit: See summary before of what I found (and added) test-wise.

wyardley avatar Oct 18 '24 23:10 wyardley

Ok @roaks3 @melinath: tried to do a quick audit of existing test coverage. It's kind of a patchwork, though actually a reasonable amount of coverage already, esp. if we assume the same behavior between the default pool and other nodepools once this change is in place.

I'm adding a simple tweak to an existing test that will already cover tags and labels for container_cluster.node_config.

Here's what I found (may not be perfect)

  • logging_variant update tests: TestAccContainerCluster_withLoggingVariantUpdates and TestAccContainerNodePool_withLoggingVariantUpdates
  • disk_size_gb, disk_type, machine_type, storage_pools: Seems like TestAccContainerNodePool_withMachineDiskStoragePoolsUpdate tests this? And TestAccContainerCluster_storagePoolsWithNodePool and TestAccContainerNodePool_storagePools have storage_pools parameterized but only called once. Added a change here to TestAccContainerCluster_withNodeConfig for disk_size_gb in another commit, and probably would be pretty easy to change one or two of the other ones here too).
  • taint: Tested in TestAccContainerNodePool_withTaintsUpdate. I adjusted TestAccContainerCluster_withNodeConfig so it should now test updates too.
  • tags: Not sure updates are tested in either place, but could be missing it? Adding a change in TestAccContainerCluster_withNodeConfig
  • resource_manager_tags: TestAccContainerCluster_resourceManagerTags exists but doesn't test updates; TestAccContainerNodePool_resourceManagerTags does test updates: Edit: Added an update in this PR that I think will work to testAccContainerCluster_resourceManagerTags
  • resource_labels: Looks like this is tested in TestAccContainerCluster_misc and TestAccContainerNodePool_withNodeConfig
  • labels: TestAccContainerCluster_withNodeConfig (updates in config testAccContainerCluster_withNodeConfigUpdate) - I think labels didn't change, but I'm pushing a commit so that it does change. This is a pretty good test where we could test some other small / generic stuff pretty easily since it's just two different configs vs. being parameterized.
  • image_type: Update covered in TestAccContainerNodePool_withNodeConfig and TestAccContainerCluster_withNodeConfig
  • workload_metadata_config: Update tested in TestAccContainerNodePool_withWorkloadIdentityConfig. Don't believe update is / was tested in container_cluster. Edit: added an update in this PR to test updates in TestAccContainerCluster_withWorkloadMetadataConfig too. Came across an interesting quirk there, and filed https://github.com/hashicorp/terraform-provider-google/issues/19935 for it
  • gcfs_config: (already covered, see above)
  • kubelet_config: (already covered, see above)
  • linux_node_config: Updates tested in TestAccContainerNodePool_withLinuxNodeConfig. Don't think they're currently covered in container_cluster.
  • fast_socket: Updates tested in TestAccContainerNodePool_fastSocket. Don't think they're currently covered in container_cluster. Edit: Since this also requires gvnic to be set, I added a new test for this one, TestAccContainerCluster_withNodeConfigFastSocket vs. piling yet more stuff onto the existing test with a bunch of settings.

wyardley avatar Oct 19 '24 00:10 wyardley

I was able to extend TestAccContainerCluster_withNodeConfig to include a bunch of fields that previously would have failed on update before like taint, machine_type, disk_type, disk_size_gb, tags, and labels, which dramatically reduces any testing gaps with only minimal added / modified code. I also added a ConfigPlanChecks block.

With the changes, things pass, and, even with all the separate node pool updates, it's not horribly fast:

--- PASS: TestAccContainerCluster_withNodeConfig (808.25s)
PASS
ok  	github.com/hashicorp/terraform-provider-google/google/services/container	809.408s

The same test fails as expected without the code changes.

Some of the fields (labels, workload_metadata_config, resource_manager_tags) were originally marked as fields to force recreation, presumably because of the lack of code to update those fields; I thought about trying to remove that separately, either before or after, but I think it's all intertwined, so I suggest we pull this stuff out as well as part of these changes, and then maybe add a second release note if necessary: https://github.com/GoogleCloudPlatform/magic-modules/blob/3f06b2ed38e474ba08cca2d8553d29a741ec1810/mmv1/third_party/terraform/services/container/resource_container_cluster.go.tmpl#L107C2-L110

This is a good example where the tests directly for container_cluster and adding the plancheck were useful and (assuming this change is safe and makes sense) will also help make everything more consistent.

wyardley avatar Oct 19 '24 04:10 wyardley

I've added / adjusted tests for almost all of the fields directly in google_container_cluster (see above for more detailed notes on where tests were added / adjusted). Since the gvnic setting is required, and since there are so many different ones all checked in the same test for TestAccContainerCluster_withNodeConfig, I did the fast_socket tests in a new test. The resource manager update tests are a little tricky - if someone knows more about this than I do, feel free to double-check that it makes sense (basically, I made two tag / value sets, and then swapped from 1 to 2 for the update, which was simplest without making 2 entirely separate configs the way the node pool test does it).

I can add linux_node_config tests if desired as well, though the existing tests do seem to confirm that the behavior is consistent between node pool and cluster defined node pool.

All of these passed when I ran them manually; would be great if someone can kick off the full suite again Monday.

I left the commits mostly atomic for easier review, but let me know if I should squash something.

--- PASS: TestAccContainerCluster_resourceManagerTags (763.72s)
--- PASS: TestAccContainerCluster_withNodeConfig (776.55s)
--- PASS: TestAccContainerCluster_withWorkloadMetadataConfig (963.77s)
--- PASS: TestAccContainerCluster_withLoggingVariantUpdates (570.99s)
--- PASS: TestAccContainerCluster_withNodeConfigFastSocket (846.52s)
--- PASS: TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates (1280.76s)
--- PASS: TestAccContainerCluster_misc (1257.44s)

wyardley avatar Oct 19 '24 18:10 wyardley

@roaks3 @rileykarson @melinath: I added Fixes xxx (or partial / possible fix for cases I wasn't sure about) for some additional issues that I believe will be fixed by this; feel free to add / remove in there as makes sense.

wyardley avatar Oct 21 '24 20:10 wyardley

@roaks3 This PR has been waiting for review for 3 weekdays. Please take a look! Use the label disable-review-reminders to disable these notifications.

github-actions[bot] avatar Oct 22 '24 09:10 github-actions[bot]

Thanks @wyardley . I still have one last question, but I think I'm understanding things:

  1. Previously mutable fields
  • image_type
  • kubelet_config
  • gcfs_config

Tests seem good (looks like they were all covered already), but when I looked at the implementation it didn't look 1-to-1. For example, it seems like kubelet_config would now use ForceSendFields when it is nil. Sorry if I missed it, but are we accounting for potential changes to the behavior of these fields?

  1. Previously mutable fields that didn't work
  • logging_variant
  • disk_size_gb
  • disk_type
  • machine_type
  • storage_pools
  • taint
  • tags
  • resource_labels
  • linux_node_config
  • fast_socket

It seems to me like tests are good and you've filled any gaps. Behavior was previously broken, and this is the focus of the bug fix.

  1. Previously immutable fields
  • labels
  • workload_metadata_config
  • resource_manager_tags

Tests seem good. These are enhancements that are included because "why not", since we are already adding in-place update behavior.

roaks3 avatar Oct 23 '24 20:10 roaks3

For example, it seems like kubelet_config would now use ForceSendFields when it is nil.

Oh, that may be integrating https://github.com/GoogleCloudPlatform/magic-modules/pull/11978 FWIW

rileykarson avatar Oct 23 '24 20:10 rileykarson

@rileykarson thanks, yes, was just trying to pull up that reference. So I think either neutral or actually fixing something. That one I worked on, so pretty confident about.

I would guess / hope that any other drift is accidental and that the node pool is more likely to be correct, but let me know if you see other areas of concern. Further, this PR should help avoid that kind of mistake / drift in the future.

wyardley avatar Oct 23 '24 20:10 wyardley

Couple other things:

  • can someone review the list of fixed bugs in the description to make sure it looks good?
  • I think there were some recent changes to tests. There’s no conflicts, so will those get pulled in if tests are rerun via an extra merge step, or should I rebase?

wyardley avatar Oct 23 '24 20:10 wyardley

Awesome, thanks!

I also just double checked the tests with what you stated in a previous comment, and everything looked good except the missing linux_node_config that you mentioned. The existing node pool test is probably sufficient honestly, but since we've already got all of the others tested explicitly for the container cluster, I would actually recommend adding that one too. It might be a bit surprising/confusing otherwise, and the explicit tests are nice should anything need to diverge.

roaks3 avatar Oct 23 '24 20:10 roaks3

Oh and yes, I'm reviewing the bugs you listed next, and I don't think you need to rebase.

roaks3 avatar Oct 23 '24 20:10 roaks3

Is there a specific subfield I should test for linux_node_config, or should I combine? If you look at the comparable tests in container_node_pool, there are several, and they’re not super trivial to replicate; though I can take a pass. I’d rather not add more or more complex tests than necessary.

wyardley avatar Oct 23 '24 20:10 wyardley

I would recommend a minimal linux_node_config update test to verify it can be updated successfully. So maybe just a single subfield.

Reviewed related issues and added some notes, but overall they look good.

I wanted to ask about https://github.com/hashicorp/terraform-provider-google/issues/19056, which mentions containerd_config. Is that related to your fix here? I couldn't find update logic for this field being added or altered with nodePoolNodeConfigUpdate().

roaks3 avatar Oct 23 '24 20:10 roaks3

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 204 insertions(+), 181 deletions(-)) google-beta provider: Diff ( 3 files changed, 204 insertions(+), 181 deletions(-))

Missing test report

Your PR includes resource fields which are not covered by any test.

Resource: google_container_cluster (415 total tests) Please add an acceptance test which includes these fields. The test should include the following:

resource "google_container_cluster" "primary" {
  node_config {
    resource_manager_tags = # value needed
  }
}


modular-magician avatar Oct 23 '24 21:10 modular-magician

I wanted to ask about https://github.com/hashicorp/terraform-provider-google/issues/19056, which mentions containerd_config. Is that related to your fix here? I couldn't find update logic for this field being added or altered with nodePoolNodeConfigUpdate().

@roaks3 interesting - you're right. I'll comment in that issue.

wyardley avatar Oct 23 '24 21:10 wyardley

Tests analytics

Total tests: 213 Passed tests: 195 Skipped tests: 13 Affected tests: 5

Click here to see the affected service packages
  • container

Action taken

Found 5 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_resourceManagerTags
  • TestAccContainerCluster_withNodeConfig
  • TestAccContainerCluster_withNodeConfigFastSocket
  • TestAccContainerCluster_withSandboxConfig
  • TestAccContainerCluster_withWorkloadMetadataConfig

Get to know how VCR tests work

modular-magician avatar Oct 23 '24 21:10 modular-magician

Any context on the missing resource test warning? There is a test that tests this in the PR (interestingly, it's not complaining about linux_node_config, which isn't tested).

IIRC, that test passed for me, and I think the VCR tests getting as far as they did means the build passed, but sometimes this warning is the result of something that is failing with the build.

wyardley avatar Oct 23 '24 21:10 wyardley

🟢 Tests passed during RECORDING mode: TestAccContainerCluster_withNodeConfig [Debug log] TestAccContainerCluster_withNodeConfigFastSocket [Debug log] TestAccContainerCluster_withSandboxConfig [Debug log] TestAccContainerCluster_withWorkloadMetadataConfig [Debug log]

🟢 No issues found for passed tests after REPLAYING rerun.


🔴 Tests failed during RECORDING mode: TestAccContainerCluster_resourceManagerTags [Error message] [Debug log]

🔴 Errors occurred during RECORDING mode. Please fix them to complete your PR.

View the build log or the debug log for each test

modular-magician avatar Oct 23 '24 21:10 modular-magician

I pushed up the linux_node_config test, which seems to be passing locally, and will also test TestAccContainerCluster_resourceManagerTags again to see what's going on there, but I guess that explains the test coverage error.

wyardley avatar Oct 23 '24 21:10 wyardley

@roaks3 do you think maybe it needs something like the below (BootstrapPSARole()) in the TestAccContainerCluster_resourceManagerTags test, especially since the terraform code is invoked multiple times? That may help if the test issue is related to this bit

resource "google_project_iam_member" "tagHoldAdmin" {
  project = "%[1]s" 
  role    = "roles/resourcemanager.tagHoldAdmin"
  member  = "serviceAccount:service-${data.google_project.project.number}@container-engine-robot.iam.gserviceaccount.com"
}       

I don't totally get how this (BootstrapPSARole()) works still, because in my limited local tests, the test seems to clean up the binding when it's done, so wouldn't this fail every time, and certainly on the first run? I see it used in other tests, though, so assuming it will work?

diff --git a/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl
index 9b54afebd..482695b3d 100644
--- a/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl
+++ b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl
@@ -68,6 +68,10 @@ func TestAccContainerCluster_resourceManagerTags(t *testing.T) {
 	networkName := acctest.BootstrapSharedTestNetwork(t, "gke-cluster")
 	subnetworkName := acctest.BootstrapSubnet(t, "gke-cluster", networkName)
 
+	if acctest.BootstrapPSARole(t, "service-", "container-engine-robot", "roles/resourcemanager.tagHoldAdmin") {
+		t.Fatal("Stopping the test because a role was added to the policy.")
+	}
+
 	acctest.VcrTest(t, resource.TestCase{
 		PreCheck:                 func() { acctest.AccTestPreCheck(t) },
 		ProtoV5ProviderFactories: acctest.ProtoV5ProviderFactories(t),

Or was the failure in CI something else besides a permissions issue? (from local runs, also seeing a GCE level stockout issue ("GCE_STOCKOUT")

wyardley avatar Oct 23 '24 22:10 wyardley

--- PASS: TestAccContainerCluster_resourceManagerTags (829.99s)
PASS
ok  	github.com/hashicorp/terraform-provider-google/google/services/container	831.174s

My runs that didn't hit a stockout condition seem to pass, so I guess we could do some combination of these things, depending on what you're seeing:

  1. ~Switch to a different instance type than n1~ edit: pushed a new commit taking the default (currently e2-medium).
  2. Switch to a different zone (-f, let's say) vs -a
  3. Add the bootstrap function above, which I think is a good idea anyway? I've pushed that up, but let me know if you want me to remove it.

wyardley avatar Oct 24 '24 05:10 wyardley

@GoogleCloudPlatform/terraform-team @roaks3 This PR has been waiting for review for 1 week. Please take a look! Use the label disable-review-reminders to disable these notifications.

github-actions[bot] avatar Oct 24 '24 09:10 github-actions[bot]