container: make `cpu_manager_policy` optional in `kubelet_config`
Part of https://github.com/hashicorp/terraform-provider-google/issues/19225
This should resolve some confusing behavior with cpu_manager_policy
- It frequently will show a permadrift when it can't be set
- It also doesn't seem to accept the documented value of "none" as an empty value, though the previously undocumented empty string (
"") seems to work.
https://github.com/GoogleCloudPlatform/magic-modules/pull/3760/commits/efb71a903fc26775d568b8129ef18ce707dfbdc6#r458238583
https://github.com/GoogleCloudPlatform/magic-modules/pull/3760/commits/efb71a903fc26775d568b8129ef18ce707dfbdc6#r473173480
☝️ context on when it was originally marked Required
This doesn't resolve all of the issues, but resolves other issues where it must be set where it's not actually needed (for example, if insecure_kubelet_readonly_port_enabled is set).
It appears that it was marked as Required somewhat arbitrarily (see above), and it's also possible that some of what's in place is tied to an API level bug that may have since been resolved. Maybe we could require at last one instead -- happy to do that if there's a good example to follow.
I did come across https://github.com/hashicorp/terraform-provider-google/issues/15767 while testing this, but I think this is neutral as far as that goes.
Release Note Template for Downstream PRs (will be copied)
container: make `cpu_manager_policy` optional in `kubelet_config`
Hello! I am a robot. Tests will require approval from a repository maintainer to run.
@melinath, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.
You can help make sure that review is quick by doing a self-review and by running impacted tests locally.
Here's the full PR where it was originally marked required: https://github.com/GoogleCloudPlatform/magic-modules/pull/3760
Hi there, I'm the Modular magician. I've detected the following information about your changes:
Diff report
Your PR generated some diffs in downstreams - here they are.
google provider: Diff ( 3 files changed, 5 insertions(+), 11 deletions(-))
google-beta provider: Diff ( 3 files changed, 5 insertions(+), 11 deletions(-))
Missing test report
Your PR includes resource fields which are not covered by any test.
Resource: google_container_cluster (382 total tests)
Please add an acceptance test which includes these fields. The test should include the following:
resource "google_container_cluster" "primary" {
node_config {
kubelet_config {
cpu_manager_policy = # value needed
}
}
}
@hoskeri could you confirm the expected API behavior?
Tests analytics
Total tests: 203 Passed tests: 188 Skipped tests: 13 Affected tests: 2
Click here to see the affected service packages
- container
Action taken
Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
- TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
- TestAccContainerNodePool_withKubeletConfig
I’m not sure if there’s a permadiff in this case; I imagine a real fix to the linked issue may involve some more code changes.
This is only intended to be a partial fix in that I don’t believe this parameter is actually required? But I think the behavior may be a net neutral or slight improvement as it is, and also lets people set unrelated parameters when needed. If the value is unset, I believe the behavior should already be default by setting it to an empty string or ”none”
If I’m understanding the original intent correctly the main point was to make sure that at least one parameter was set in the block (which I think there’s a AtLeastOneOf function mentioned in that original commit to switch to if that’s the use case we’re concerned about).
Maybe there are some other cases where certain things are required, but AFAICT, even the other cpu_* policies don't directly require this setting to be explicitly set.
So this PR solves the problem of people not being able to set other parameters in the same block without having to set cpu_manager_policy at all when it’s not needed, and adds / updates the docs very slightly, while punting on the ongoing problems with other types of known issues with these resources (and the confusion of "none" vs "").
Put differently: other than allowing some people to remove the attribute, I don't believe this should cause a change in behavior for anyone who has it currently set to one of the 3 allowed values ("static", "", or "none"), and should simplify things and possibly avoid a permadiff for people who don't have it set at all, but also want to set, e.g., insecure_kubelet_readonly_port_enabled.
I don’t think this should change the behavior for existing users as best I could ascertain.
$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$
TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates[Debug log]
TestAccContainerNodePool_withKubeletConfig[Debug log]
$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$
$\textcolor{green}{\textsf{All tests passed!}}$
TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
note: with this change, we could technically remove cpu_manager_policy from the test if we wanted it to be (and left in the broader TestAccContainerNodePool_withKubeletConfig one.
I can't comment on how it's supposed to behave, obviously, so hopefully we'll get a response on that, but:
https://pkg.go.dev/google.golang.org/api/container/v1#NodeKubeletConfig
Control the CPU management policy on the node. See https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ The following values are allowed. * "none": the default, which represents the existing scheduling behavior. * "static": allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node. The default value is 'none' if unspecified. CpuManagerPolicy string
json:"cpuManagerPolicy,omitempty"
Even as-is, though, "static" and "" (which is allowed) seem to work as expected, other than the update variant.
What I see is if I create a new cluster with the current 6.0.1 (unpatched) provider, it sends "none", and the API does seem to return it vs. returning empty for default. So maybe even though they have identical results, seemingly, setting none does not behave the same way as not setting it, if you look at the API response further down.
node_config {
kubelet_config {
cpu_manager_policy = "none"
}
}
Call to API during initial cluster creation:
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5: "nodeConfig": {
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5: "kubeletConfig": {
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5: "cpuCfsQuota": false,
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5: "cpuManagerPolicy": "none"
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5: },
Interestingly, the API does seem to return a value vs. no value in that case: maybe @hoskeri can clarify. I'm not sure what that old comment about "default" as a value means or if that's still the case, but I haven't tried sending that, and I don't think the provider does either.
% gcloud container clusters describe --location us-central1-f xxxx | yq .nodeConfig.kubeletConfig
cpuCfsQuota: false
cpuManagerPolicy: none
(cpuCfsQuota being set to false is due to the existing known issues we're looking at separately)
This is probably good from the standpoint of the tf provider - there is no permadrift when setting the value to "none" explicitly, so I am guessing most of the issues we have with this parameter are around updates not working.
@melinath This PR has been waiting for review for 3 weekdays. Please take a look! Use the label disable-review-reminders to disable these notifications.
Side note: I confirmed with someone that this was (at the time this was added) common practice at one point to require one attribute to avoid a situation where an empty block was defined and cause issues, and apparently there are much better checks for this in place now.
@GoogleCloudPlatform/terraform-team @melinath This PR has been waiting for review for 1 week. Please take a look! Use the label disable-review-reminders to disable these notifications.
@melinath not sure if you heard back from @hoskeri, but a question I have is whether if we can establish that this PR is relatively safe / neutral as-is in terms of behavior, whether you and the team would feel comfortable allowing this as-is before a proper resolution to https://github.com/hashicorp/terraform-provider-google/issues/19225.
I could do some more testing if there are specific scenarios you're thinking of, but IMO, this will improve the existing behavior in a relatively safe way.
I do absolutely think the behavior and the docs around the confusing should eventually be resolved and made more consistent, but I think that this change as-is should help avoid confusing diffs / behavior to users who need to set other values within kubelet_config.
Once we get into coercing "" and "none" to the same value, I think we have a greater risk that planning may show confusing diffs to the user, even if it can be done in a way that behaves the same.
apologies for the delayed review. I've been working through a backlog. I should be able to take a look tomorrow. In general this does look like a good change; I just want to do some manual testing to make sure I understand how it behaves.
My main concern is that treating "" (unset) and "none" as distinct values even though they both behave the same may cause confusion for users, since the server could silently change the behavior for that case (similar to what will happen with insecure_kubelet_readonly_port_enabled eventually).
I totally agree, and I think this needs a real fix - am just trying to document the existing behavior that's been there (and has been in the tests) but isn't documented, especially since using "" seems to work when "none" does not work properly. I'm just hoping to kind of sidestep that issue for now, but it probably will be unavoidable to fix it. And, it's probably not really "breaking" if you did decide to drop support for "" (unless we're making that worse by documenting that here 🫠).
The API also doesn't behave quite the way I would have been expected (since , as mentioned in this comment, so IMO it is best if your team figures out internally both how the API should work, and how you think the provider should track default and / or unset values internally.
Anyway, I rebased and did one force-push just because this branch was kind of stale, but I left all the commit history atomic still, so hopefully that's still easy enough to review.
https://github.com/GoogleCloudPlatform/magic-modules/blob/29a1163236a8c87f3a9bde0ee4d95d1d2909bcb4/mmv1/third_party/terraform/services/container/resource_container_node_pool_test.go.erb#L3142-L3143
Also not 100% sure about the accuracy of these conditions, and whether the comments and / or test need any additional adjustments (or whether we need to do more provider level validation if the user tries to set disallowed combinations, or just let the API calls fail if something is not valid).
yeah force pushes that leave the history intact are fine.
Hi there, I'm the Modular magician. I've detected the following information about your changes:
Diff report
Your PR generated some diffs in downstreams - here they are.
google provider: Diff ( 3 files changed, 3 insertions(+), 11 deletions(-))
google-beta provider: Diff ( 3 files changed, 3 insertions(+), 11 deletions(-))
Missing test report
Your PR includes resource fields which are not covered by any test.
Resource: google_container_cluster (388 total tests)
Please add an acceptance test which includes these fields. The test should include the following:
resource "google_container_cluster" "primary" {
node_config {
kubelet_config {
cpu_manager_policy = # value needed
}
}
}
Tests analytics
Total tests: 208 Passed tests: 193 Skipped tests: 13 Affected tests: 2
Click here to see the affected service packages
- container
Action taken
Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
- TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
- TestAccContainerNodePool_withKubeletConfig
$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$
TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates[Debug log]
TestAccContainerNodePool_withKubeletConfig[Debug log]
$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$
$\textcolor{green}{\textsf{All tests passed!}}$
Sure - I can add it. Do you want it in an existing test or a standalone new test case?
only caveat is that, since we already know that update is broken when it’s used that way, it probably can’t be a test where the resource is updated.
It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.
It is still present here, right?
https://github.com/GoogleCloudPlatform/magic-modules/blob/4ca0795882d087ae140ceaef479cc9e00c54f82b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.erb#L6692
Sure - I can add it. Do you want it in an existing test or a standalone new test case?
only caveat is that, since we already know that update is broken when it’s used that way, it probably can’t be a test where the resource is updated.
An existing test would be ideal (but it doesn't have to be one of the ones already modified in this PR).
It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.
It is still present here, right?
https://github.com/GoogleCloudPlatform/magic-modules/blob/4ca0795882d087ae140ceaef479cc9e00c54f82b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.erb#L6692
That's node_pool.node_config.kubelet_config (as opposed to just node_config.kubelet_config).
An existing test would be ideal (but it doesn't have to be one of the ones already modified in this PR).
@melinath I didn't see any real obvious choice, and the insecure_... one won't work because updates, so I made a new one that includes all the other node_config.kubelet_config settings, with a note? It could be folded into the insecure kubelet one later if someone were so inclined? Checking that it passes now, but let me know if this approach looks Ok to you.
sure, seems fine
Hi there, I'm the Modular magician. I've detected the following information about your changes:
Diff report
Your PR generated some diffs in downstreams - here they are.
google provider: Diff ( 3 files changed, 59 insertions(+), 11 deletions(-))
google-beta provider: Diff ( 3 files changed, 59 insertions(+), 11 deletions(-))
Tests analytics
Total tests: 209 Passed tests: 195 Skipped tests: 13 Affected tests: 1
Click here to see the affected service packages
- container
Action taken
Found 1 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
- TestAccContainerCluster_withNodeConfigKubeletConfigSettings
$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$
TestAccContainerCluster_withNodeConfigKubeletConfigSettings[Debug log]
$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$
$\textcolor{green}{\textsf{All tests passed!}}$