container: fix missing updates for `google_container_cluster.node_config` subfields
This resolves an issue where many subfields of node_config in a cluster (which affects the default node-pool "default-pool" when remove_default_node_pool is not set to false) don't support updates properly, and also allows 3 subfields which had been previously set to force recreation of the default node pool (because updates were non-functional) to be updated in-place.
Some acceptance tests are added, and some existing tests have been adjusted to confirm that the behavior is the same between google_container_cluster.node_config and google_container_node_pool.node_config.
Fixes hashicorp/terraform-provider-google#19225 Fixes hashicorp/terraform-provider-google#18208 Fixes hashicorp/terraform-provider-google#19056
Possible fix for hashicorp/terraform-provider-google#17522
Followup to #11826 where the code used by the regular node pool update code was broken out.
Release Note Template for Downstream PRs (will be copied)
container: fixed missing in-place updates for some `google_container_cluster.node_config` subfields
container: added in-place update support for `labels`, `resource_manager_tags` and `workload_metadata_config` in `google_container_cluster.node_config`
References
- https://github.com/hashicorp/terraform-provider-google/issues/19225
- #11272
- #11717
- #11826
b/361634104
Hello! I am a robot. Tests will require approval from a repository maintainer to run.
@roaks3, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.
You can help make sure that review is quick by doing a self-review and by running impacted tests locally.
Hi there, I'm the Modular magician. I've detected the following information about your changes:
Diff report
Your PR generated some diffs in downstreams - here they are.
google provider: Diff ( 1 file changed, 5 insertions(+), 123 deletions(-))
google-beta provider: Diff ( 1 file changed, 5 insertions(+), 123 deletions(-))
Tests analytics
Total tests: 212 Passed tests: 199 Skipped tests: 13 Affected tests: 0
Click here to see the affected service packages
- container
🟢 All tests passed!
View the build log
@roaks3 Sure!
- The
node_configsubfield ofgoogle_container_clustercontrols the settings for the default node-pool when remove_default_node_pool is not set to false. The linked issue mentions how many of these fields previously didn't support updates when they should, causing an inability to control those settings (which would be updatable if they were on a standalone nodepool). It's actually the exact same resource. As suggested, I laid the foundation for this change in #11826 by moving the code for whennode_cofigin thecontainer_node_poolresource has changes intonode_configinto a standalone function. #11272 also has some background / context. In theory, the tests for updates on some of these fields in the node pool resource should behave the same way, and some of the existing tests (includingTestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates and, I believe,TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates` should also cover this scenario), but I'm happy to add or update some existing tests (within reason) if that helps create more confidence. - Have you read through the linked issues / PRs? Those have got a lot of discussion that hopefully outlines both what the issue is, and some other PRs that have addressed part of this issue. Some of the folks at Google who have been involved in previous discussions / PRs related to this are @rileykarson and @melinath, who probably have the most context here. In particular, see this comment. I've linked to some related issues directly in the description now as well under "References".
- I think the biggest thing is: there are obviously some test gaps here, most of which relate to things that never worked in this context (and maybe in some cases, aren't tested in the
container_node_poolresource either for the use case of updating). Again, within reason, I am happy to add some additional test coverage if it helps improve confidence in this change.
So, tl;dr this should make the behavior of the default node pool within the container_cluster resource more consistent with the behavior of independently defined or additional node pools, and fix some cases where an existing setting for default-pool could not be updated in-place with failure / permadiff.
Let me know if that helps in terms of what you were looking for.
A bit broader context: google_container_cluster having the ability to define a node pool resource within the cluster resource (or outside of it) has been a persistent cause of headaches with this resource (one reason that many people choose to delete the default pool and manage nodepools separately).
@wyardley thanks for the response!
Yea, I think I understand the intent to update this config, and some of the nuance where this is shared between resources. But I am still wondering if you could give a brief description of the actual code changes here. (I could probably deduce the answer by reading all of the related references, but it would take quite a bit longer for me to wrap my head around everything for this review). For example, why are all of these lines being removed? Are they effectively already copied under the function you are replacing them with, or do I need to review those functions as well to determine what behavior has changed? In short, what could break?
Acknowledged on the testing. Once I know a bit more about what is changing that will help me weigh in, but I'm tempted to say this is the type of change that would be covered well enough by existing tests. We may want to add some of these fields to existing tests though, if they are any that aren't already included.
Yea, I think I understand the intent to update this config, and some of the nuance where this is shared between resources. But I am still wondering if you could give a brief description of the actual code changes here.
So the bug here is that not all of the fields that support update in place were being updated when defined for the default pool via node_config. Some of the code, such as the fields I added in #11272 had the update implemented and had tests for it (because @melinath asked me to add them). In those cases, they would have been separately implemented in both places already, but in other cases, the code was already missing from one place.
The plan therefore was to first extract out the node_config updates in container_node_pool to a separate function as one step, and then call that same function for node_config within container_cluster.
So the simple answer to your question is that the code being pulled out should be a functionally equivalent subset of what's now being called. Some of the code should be identical or close to identical (other than the semantics around prefix being a little different and so on) to the code in nodePoolNodeConfigUpdate() that's now being called (e.g., node_config.0.image_type, node_config.0.kubelet_config, which I fixed in #11697 after adding part of it for #11272, and gcfs_config, fixed in #11717) , whereas some of the code (essentially, missing code to do the remainder of the updates) was simply not present here previously (e.g., node_config.0.tags, node_config.0.resource_manager_tags, the block with disk_size_gb, disk_type, etc.), which is the part that's being fixed. But the actual "thing" being operated on (a node pool) is the same in both cases (a node pool).
For the parts that were already implemented in both places, you should be able to compare the code and see that those sections are pretty similar.
For example, compare: https://github.com/GoogleCloudPlatform/magic-modules/blob/c438c6eb725d6229ffdccdd171092e3f6be6ac7d/mmv1/third_party/terraform/services/container/resource_container_cluster.go.tmpl#L3846-L3892 to https://github.com/GoogleCloudPlatform/magic-modules/blob/c438c6eb725d6229ffdccdd171092e3f6be6ac7d/mmv1/third_party/terraform/services/container/node_config.go.tmpl#L2415-L2448
I checked and when the default node pool is defined via node_pool {} within container_cluster, that already functions as it should.
linux_node_config and the disk size etc. ones may need to be double-checked (I can play with it a little), though I think this code should handle it correctly too - in my understanding, if updating it for externally defined nodepools works, it should also work for the default pool. And, to whatever extent these things are tested under container_node_pool, my understanding is that those tests should more or less cover their use here.
However, some of these resources may not have tests in either place that test the path where the resource is updated in-place.
Drive-by comment: @wyardley will existing tests catch this error? or is this fixing bugs that were not caught by the existing tests?
@melinath: There were presumably no (update) tests for the things that were broken before, at least not directly within container_cluster.node_config, since they’d be failing, and I think when we discussed, there was a preference to not add (skipped) failing tests.
~I’m not sure how comprehensive the tests under container_node_pool.node_config are - I’d guess there are missing ones there as well.~ But in any cases where they do exist, presumably testing one should test the other once they’re using the same code.
~I ran through the tests that do have update tests, but agree there are probably gaps.~
Edit: See summary before of what I found (and added) test-wise.
Ok @roaks3 @melinath: tried to do a quick audit of existing test coverage. It's kind of a patchwork, though actually a reasonable amount of coverage already, esp. if we assume the same behavior between the default pool and other nodepools once this change is in place.
I'm adding a simple tweak to an existing test that will already cover tags and labels for container_cluster.node_config.
Here's what I found (may not be perfect)
logging_variantupdate tests:TestAccContainerCluster_withLoggingVariantUpdatesandTestAccContainerNodePool_withLoggingVariantUpdatesdisk_size_gb,disk_type,machine_type,storage_pools: Seems likeTestAccContainerNodePool_withMachineDiskStoragePoolsUpdatetests this? AndTestAccContainerCluster_storagePoolsWithNodePoolandTestAccContainerNodePool_storagePoolshavestorage_poolsparameterized but only called once. Added a change here toTestAccContainerCluster_withNodeConfigfordisk_size_gbin another commit, and probably would be pretty easy to change one or two of the other ones here too).taint: Tested inTestAccContainerNodePool_withTaintsUpdate. I adjustedTestAccContainerCluster_withNodeConfigso it should now test updates too.tags: Not sure updates are tested in either place, but could be missing it? Adding a change inTestAccContainerCluster_withNodeConfigresource_manager_tags:TestAccContainerCluster_resourceManagerTagsexists but doesn't test updates;TestAccContainerNodePool_resourceManagerTagsdoes test updates: Edit: Added an update in this PR that I think will work totestAccContainerCluster_resourceManagerTagsresource_labels: Looks like this is tested inTestAccContainerCluster_miscandTestAccContainerNodePool_withNodeConfiglabels:TestAccContainerCluster_withNodeConfig(updates in configtestAccContainerCluster_withNodeConfigUpdate) - I think labels didn't change, but I'm pushing a commit so that it does change. This is a pretty good test where we could test some other small / generic stuff pretty easily since it's just two different configs vs. being parameterized.image_type: Update covered inTestAccContainerNodePool_withNodeConfigandTestAccContainerCluster_withNodeConfigworkload_metadata_config: Update tested inTestAccContainerNodePool_withWorkloadIdentityConfig. Don't believe update is / was tested incontainer_cluster. Edit: added an update in this PR to test updates inTestAccContainerCluster_withWorkloadMetadataConfigtoo. Came across an interesting quirk there, and filed https://github.com/hashicorp/terraform-provider-google/issues/19935 for itgcfs_config: (already covered, see above)kubelet_config: (already covered, see above)linux_node_config: Updates tested inTestAccContainerNodePool_withLinuxNodeConfig. Don't think they're currently covered incontainer_cluster.fast_socket: Updates tested inTestAccContainerNodePool_fastSocket. Don't think they're currently covered incontainer_cluster. Edit: Since this also requires gvnic to be set, I added a new test for this one,TestAccContainerCluster_withNodeConfigFastSocketvs. piling yet more stuff onto the existing test with a bunch of settings.
I was able to extend TestAccContainerCluster_withNodeConfig to include a bunch of fields that previously would have failed on update before like taint, machine_type, disk_type, disk_size_gb, tags, and labels, which dramatically reduces any testing gaps with only minimal added / modified code. I also added a ConfigPlanChecks block.
With the changes, things pass, and, even with all the separate node pool updates, it's not horribly fast:
--- PASS: TestAccContainerCluster_withNodeConfig (808.25s)
PASS
ok github.com/hashicorp/terraform-provider-google/google/services/container 809.408s
The same test fails as expected without the code changes.
Some of the fields (labels, workload_metadata_config, resource_manager_tags) were originally marked as fields to force recreation, presumably because of the lack of code to update those fields; I thought about trying to remove that separately, either before or after, but I think it's all intertwined, so I suggest we pull this stuff out as well as part of these changes, and then maybe add a second release note if necessary:
https://github.com/GoogleCloudPlatform/magic-modules/blob/3f06b2ed38e474ba08cca2d8553d29a741ec1810/mmv1/third_party/terraform/services/container/resource_container_cluster.go.tmpl#L107C2-L110
This is a good example where the tests directly for container_cluster and adding the plancheck were useful and (assuming this change is safe and makes sense) will also help make everything more consistent.
I've added / adjusted tests for almost all of the fields directly in google_container_cluster (see above for more detailed notes on where tests were added / adjusted). Since the gvnic setting is required, and since there are so many different ones all checked in the same test for TestAccContainerCluster_withNodeConfig, I did the fast_socket tests in a new test. The resource manager update tests are a little tricky - if someone knows more about this than I do, feel free to double-check that it makes sense (basically, I made two tag / value sets, and then swapped from 1 to 2 for the update, which was simplest without making 2 entirely separate configs the way the node pool test does it).
I can add linux_node_config tests if desired as well, though the existing tests do seem to confirm that the behavior is consistent between node pool and cluster defined node pool.
All of these passed when I ran them manually; would be great if someone can kick off the full suite again Monday.
I left the commits mostly atomic for easier review, but let me know if I should squash something.
--- PASS: TestAccContainerCluster_resourceManagerTags (763.72s)
--- PASS: TestAccContainerCluster_withNodeConfig (776.55s)
--- PASS: TestAccContainerCluster_withWorkloadMetadataConfig (963.77s)
--- PASS: TestAccContainerCluster_withLoggingVariantUpdates (570.99s)
--- PASS: TestAccContainerCluster_withNodeConfigFastSocket (846.52s)
--- PASS: TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates (1280.76s)
--- PASS: TestAccContainerCluster_misc (1257.44s)
@roaks3 @rileykarson @melinath: I added Fixes xxx (or partial / possible fix for cases I wasn't sure about) for some additional issues that I believe will be fixed by this; feel free to add / remove in there as makes sense.
@roaks3 This PR has been waiting for review for 3 weekdays. Please take a look! Use the label disable-review-reminders to disable these notifications.
Thanks @wyardley . I still have one last question, but I think I'm understanding things:
- Previously mutable fields
image_typekubelet_configgcfs_config
Tests seem good (looks like they were all covered already), but when I looked at the implementation it didn't look 1-to-1. For example, it seems like kubelet_config would now use ForceSendFields when it is nil. Sorry if I missed it, but are we accounting for potential changes to the behavior of these fields?
- Previously mutable fields that didn't work
logging_variantdisk_size_gbdisk_typemachine_typestorage_poolstainttagsresource_labelslinux_node_configfast_socket
It seems to me like tests are good and you've filled any gaps. Behavior was previously broken, and this is the focus of the bug fix.
- Previously immutable fields
labelsworkload_metadata_configresource_manager_tags
Tests seem good. These are enhancements that are included because "why not", since we are already adding in-place update behavior.
For example, it seems like kubelet_config would now use ForceSendFields when it is nil.
Oh, that may be integrating https://github.com/GoogleCloudPlatform/magic-modules/pull/11978 FWIW
@rileykarson thanks, yes, was just trying to pull up that reference. So I think either neutral or actually fixing something. That one I worked on, so pretty confident about.
I would guess / hope that any other drift is accidental and that the node pool is more likely to be correct, but let me know if you see other areas of concern. Further, this PR should help avoid that kind of mistake / drift in the future.
Couple other things:
- can someone review the list of fixed bugs in the description to make sure it looks good?
- I think there were some recent changes to tests. There’s no conflicts, so will those get pulled in if tests are rerun via an extra merge step, or should I rebase?
Awesome, thanks!
I also just double checked the tests with what you stated in a previous comment, and everything looked good except the missing linux_node_config that you mentioned. The existing node pool test is probably sufficient honestly, but since we've already got all of the others tested explicitly for the container cluster, I would actually recommend adding that one too. It might be a bit surprising/confusing otherwise, and the explicit tests are nice should anything need to diverge.
Oh and yes, I'm reviewing the bugs you listed next, and I don't think you need to rebase.
Is there a specific subfield I should test for linux_node_config, or should I combine? If you look at the comparable tests in container_node_pool, there are several, and they’re not super trivial to replicate; though I can take a pass. I’d rather not add more or more complex tests than necessary.
I would recommend a minimal linux_node_config update test to verify it can be updated successfully. So maybe just a single subfield.
Reviewed related issues and added some notes, but overall they look good.
I wanted to ask about https://github.com/hashicorp/terraform-provider-google/issues/19056, which mentions containerd_config. Is that related to your fix here? I couldn't find update logic for this field being added or altered with nodePoolNodeConfigUpdate().
Hi there, I'm the Modular magician. I've detected the following information about your changes:
Diff report
Your PR generated some diffs in downstreams - here they are.
google provider: Diff ( 3 files changed, 204 insertions(+), 181 deletions(-))
google-beta provider: Diff ( 3 files changed, 204 insertions(+), 181 deletions(-))
Missing test report
Your PR includes resource fields which are not covered by any test.
Resource: google_container_cluster (415 total tests)
Please add an acceptance test which includes these fields. The test should include the following:
resource "google_container_cluster" "primary" {
node_config {
resource_manager_tags = # value needed
}
}
I wanted to ask about https://github.com/hashicorp/terraform-provider-google/issues/19056, which mentions containerd_config. Is that related to your fix here? I couldn't find update logic for this field being added or altered with nodePoolNodeConfigUpdate().
@roaks3 interesting - you're right. I'll comment in that issue.
Tests analytics
Total tests: 213 Passed tests: 195 Skipped tests: 13 Affected tests: 5
Click here to see the affected service packages
- container
Action taken
Found 5 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
- TestAccContainerCluster_resourceManagerTags
- TestAccContainerCluster_withNodeConfig
- TestAccContainerCluster_withNodeConfigFastSocket
- TestAccContainerCluster_withSandboxConfig
- TestAccContainerCluster_withWorkloadMetadataConfig
Any context on the missing resource test warning? There is a test that tests this in the PR (interestingly, it's not complaining about linux_node_config, which isn't tested).
IIRC, that test passed for me, and I think the VCR tests getting as far as they did means the build passed, but sometimes this warning is the result of something that is failing with the build.
🟢 Tests passed during RECORDING mode:
TestAccContainerCluster_withNodeConfig [Debug log]
TestAccContainerCluster_withNodeConfigFastSocket [Debug log]
TestAccContainerCluster_withSandboxConfig [Debug log]
TestAccContainerCluster_withWorkloadMetadataConfig [Debug log]
🟢 No issues found for passed tests after REPLAYING rerun.
🔴 Tests failed during RECORDING mode:
TestAccContainerCluster_resourceManagerTags [Error message] [Debug log]
🔴 Errors occurred during RECORDING mode. Please fix them to complete your PR.
I pushed up the linux_node_config test, which seems to be passing locally, and will also test TestAccContainerCluster_resourceManagerTags again to see what's going on there, but I guess that explains the test coverage error.
@roaks3 do you think maybe it needs something like the below (BootstrapPSARole()) in the TestAccContainerCluster_resourceManagerTags test, especially since the terraform code is invoked multiple times? That may help if the test issue is related to this bit
resource "google_project_iam_member" "tagHoldAdmin" {
project = "%[1]s"
role = "roles/resourcemanager.tagHoldAdmin"
member = "serviceAccount:service-${data.google_project.project.number}@container-engine-robot.iam.gserviceaccount.com"
}
I don't totally get how this (BootstrapPSARole()) works still, because in my limited local tests, the test seems to clean up the binding when it's done, so wouldn't this fail every time, and certainly on the first run? I see it used in other tests, though, so assuming it will work?
diff --git a/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl
index 9b54afebd..482695b3d 100644
--- a/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl
+++ b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.tmpl
@@ -68,6 +68,10 @@ func TestAccContainerCluster_resourceManagerTags(t *testing.T) {
networkName := acctest.BootstrapSharedTestNetwork(t, "gke-cluster")
subnetworkName := acctest.BootstrapSubnet(t, "gke-cluster", networkName)
+ if acctest.BootstrapPSARole(t, "service-", "container-engine-robot", "roles/resourcemanager.tagHoldAdmin") {
+ t.Fatal("Stopping the test because a role was added to the policy.")
+ }
+
acctest.VcrTest(t, resource.TestCase{
PreCheck: func() { acctest.AccTestPreCheck(t) },
ProtoV5ProviderFactories: acctest.ProtoV5ProviderFactories(t),
Or was the failure in CI something else besides a permissions issue? (from local runs, also seeing a GCE level stockout issue ("GCE_STOCKOUT")
--- PASS: TestAccContainerCluster_resourceManagerTags (829.99s)
PASS
ok github.com/hashicorp/terraform-provider-google/google/services/container 831.174s
My runs that didn't hit a stockout condition seem to pass, so I guess we could do some combination of these things, depending on what you're seeing:
- ~Switch to a different instance type than
n1~ edit: pushed a new commit taking the default (currentlye2-medium). - Switch to a different zone (
-f, let's say) vs-a - Add the bootstrap function above, which I think is a good idea anyway? I've pushed that up, but let me know if you want me to remove it.
@GoogleCloudPlatform/terraform-team @roaks3 This PR has been waiting for review for 1 week. Please take a look! Use the label disable-review-reminders to disable these notifications.