magic-modules icon indicating copy to clipboard operation
magic-modules copied to clipboard

container: make `cpu_manager_policy` optional in `kubelet_config`

Open wyardley opened this issue 1 year ago • 9 comments

Part of https://github.com/hashicorp/terraform-provider-google/issues/19225

This should resolve some confusing behavior with cpu_manager_policy

  • It frequently will show a permadrift when it can't be set
  • It also doesn't seem to accept the documented value of "none" as an empty value, though the previously undocumented empty string ("") seems to work.

https://github.com/GoogleCloudPlatform/magic-modules/pull/3760/commits/efb71a903fc26775d568b8129ef18ce707dfbdc6#r458238583 https://github.com/GoogleCloudPlatform/magic-modules/pull/3760/commits/efb71a903fc26775d568b8129ef18ce707dfbdc6#r473173480 ☝️ context on when it was originally marked Required

This doesn't resolve all of the issues, but resolves other issues where it must be set where it's not actually needed (for example, if insecure_kubelet_readonly_port_enabled is set).

It appears that it was marked as Required somewhat arbitrarily (see above), and it's also possible that some of what's in place is tied to an API level bug that may have since been resolved. Maybe we could require at last one instead -- happy to do that if there's a good example to follow.

I did come across https://github.com/hashicorp/terraform-provider-google/issues/15767 while testing this, but I think this is neutral as far as that goes.

Release Note Template for Downstream PRs (will be copied)

container: make `cpu_manager_policy` optional in `kubelet_config`

wyardley avatar Aug 29 '24 04:08 wyardley

Hello! I am a robot. Tests will require approval from a repository maintainer to run.

@melinath, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.

You can help make sure that review is quick by doing a self-review and by running impacted tests locally.

github-actions[bot] avatar Aug 29 '24 04:08 github-actions[bot]

Here's the full PR where it was originally marked required: https://github.com/GoogleCloudPlatform/magic-modules/pull/3760

melinath avatar Aug 29 '24 20:08 melinath

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 5 insertions(+), 11 deletions(-)) google-beta provider: Diff ( 3 files changed, 5 insertions(+), 11 deletions(-))

Missing test report

Your PR includes resource fields which are not covered by any test.

Resource: google_container_cluster (382 total tests) Please add an acceptance test which includes these fields. The test should include the following:

resource "google_container_cluster" "primary" {
  node_config {
    kubelet_config {
      cpu_manager_policy = # value needed
    }
  }
}


modular-magician avatar Aug 29 '24 21:08 modular-magician

@hoskeri could you confirm the expected API behavior?

melinath avatar Aug 29 '24 21:08 melinath

Tests analytics

Total tests: 203 Passed tests: 188 Skipped tests: 13 Affected tests: 2

Click here to see the affected service packages
  • container

Action taken

Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
  • TestAccContainerNodePool_withKubeletConfig

Get to know how VCR tests work

modular-magician avatar Aug 29 '24 21:08 modular-magician

I’m not sure if there’s a permadiff in this case; I imagine a real fix to the linked issue may involve some more code changes.

This is only intended to be a partial fix in that I don’t believe this parameter is actually required? But I think the behavior may be a net neutral or slight improvement as it is, and also lets people set unrelated parameters when needed. If the value is unset, I believe the behavior should already be default by setting it to an empty string or ”none”

If I’m understanding the original intent correctly the main point was to make sure that at least one parameter was set in the block (which I think there’s a AtLeastOneOf function mentioned in that original commit to switch to if that’s the use case we’re concerned about).

Maybe there are some other cases where certain things are required, but AFAICT, even the other cpu_* policies don't directly require this setting to be explicitly set.

So this PR solves the problem of people not being able to set other parameters in the same block without having to set cpu_manager_policy at all when it’s not needed, and adds / updates the docs very slightly, while punting on the ongoing problems with other types of known issues with these resources (and the confusion of "none" vs "").

Put differently: other than allowing some people to remove the attribute, I don't believe this should cause a change in behavior for anyone who has it currently set to one of the 3 allowed values ("static", "", or "none"), and should simplify things and possibly avoid a permadiff for people who don't have it set at all, but also want to set, e.g., insecure_kubelet_readonly_port_enabled.

I don’t think this should change the behavior for existing users as best I could ascertain.

wyardley avatar Aug 29 '24 21:08 wyardley

$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$ TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates[Debug log] TestAccContainerNodePool_withKubeletConfig[Debug log]

$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$


$\textcolor{green}{\textsf{All tests passed!}}$

View the build log or the debug log for each test

modular-magician avatar Aug 29 '24 21:08 modular-magician

TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates

note: with this change, we could technically remove cpu_manager_policy from the test if we wanted it to be (and left in the broader TestAccContainerNodePool_withKubeletConfig one.

wyardley avatar Aug 29 '24 21:08 wyardley

I can't comment on how it's supposed to behave, obviously, so hopefully we'll get a response on that, but:

https://pkg.go.dev/google.golang.org/api/container/v1#NodeKubeletConfig

Control the CPU management policy on the node. See https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ The following values are allowed. * "none": the default, which represents the existing scheduling behavior. * "static": allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node. The default value is 'none' if unspecified. CpuManagerPolicy string json:"cpuManagerPolicy,omitempty"

Even as-is, though, "static" and "" (which is allowed) seem to work as expected, other than the update variant.

What I see is if I create a new cluster with the current 6.0.1 (unpatched) provider, it sends "none", and the API does seem to return it vs. returning empty for default. So maybe even though they have identical results, seemingly, setting none does not behave the same way as not setting it, if you look at the API response further down.

  node_config {
    kubelet_config {
      cpu_manager_policy = "none"
    }
  }

Call to API during initial cluster creation:

2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:   "nodeConfig": {
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:    "kubeletConfig": {
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:     "cpuCfsQuota": false,
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:     "cpuManagerPolicy": "none"
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:    },

Interestingly, the API does seem to return a value vs. no value in that case: maybe @hoskeri can clarify. I'm not sure what that old comment about "default" as a value means or if that's still the case, but I haven't tried sending that, and I don't think the provider does either.

% gcloud container clusters describe --location us-central1-f xxxx | yq .nodeConfig.kubeletConfig
cpuCfsQuota: false
cpuManagerPolicy: none

(cpuCfsQuota being set to false is due to the existing known issues we're looking at separately)

This is probably good from the standpoint of the tf provider - there is no permadrift when setting the value to "none" explicitly, so I am guessing most of the issues we have with this parameter are around updates not working.

wyardley avatar Aug 29 '24 22:08 wyardley

@melinath This PR has been waiting for review for 3 weekdays. Please take a look! Use the label disable-review-reminders to disable these notifications.

github-actions[bot] avatar Sep 03 '24 09:09 github-actions[bot]

Side note: I confirmed with someone that this was (at the time this was added) common practice at one point to require one attribute to avoid a situation where an empty block was defined and cause issues, and apparently there are much better checks for this in place now.

wyardley avatar Sep 03 '24 18:09 wyardley

@GoogleCloudPlatform/terraform-team @melinath This PR has been waiting for review for 1 week. Please take a look! Use the label disable-review-reminders to disable these notifications.

github-actions[bot] avatar Sep 05 '24 09:09 github-actions[bot]

@melinath not sure if you heard back from @hoskeri, but a question I have is whether if we can establish that this PR is relatively safe / neutral as-is in terms of behavior, whether you and the team would feel comfortable allowing this as-is before a proper resolution to https://github.com/hashicorp/terraform-provider-google/issues/19225.

I could do some more testing if there are specific scenarios you're thinking of, but IMO, this will improve the existing behavior in a relatively safe way.

I do absolutely think the behavior and the docs around the confusing should eventually be resolved and made more consistent, but I think that this change as-is should help avoid confusing diffs / behavior to users who need to set other values within kubelet_config.

Once we get into coercing "" and "none" to the same value, I think we have a greater risk that planning may show confusing diffs to the user, even if it can be done in a way that behaves the same.

wyardley avatar Sep 09 '24 20:09 wyardley

apologies for the delayed review. I've been working through a backlog. I should be able to take a look tomorrow. In general this does look like a good change; I just want to do some manual testing to make sure I understand how it behaves.

melinath avatar Sep 09 '24 23:09 melinath

My main concern is that treating "" (unset) and "none" as distinct values even though they both behave the same may cause confusion for users, since the server could silently change the behavior for that case (similar to what will happen with insecure_kubelet_readonly_port_enabled eventually).

I totally agree, and I think this needs a real fix - am just trying to document the existing behavior that's been there (and has been in the tests) but isn't documented, especially since using "" seems to work when "none" does not work properly. I'm just hoping to kind of sidestep that issue for now, but it probably will be unavoidable to fix it. And, it's probably not really "breaking" if you did decide to drop support for "" (unless we're making that worse by documenting that here 🫠).

The API also doesn't behave quite the way I would have been expected (since , as mentioned in this comment, so IMO it is best if your team figures out internally both how the API should work, and how you think the provider should track default and / or unset values internally.

Anyway, I rebased and did one force-push just because this branch was kind of stale, but I left all the commit history atomic still, so hopefully that's still easy enough to review.

wyardley avatar Sep 11 '24 06:09 wyardley

https://github.com/GoogleCloudPlatform/magic-modules/blob/29a1163236a8c87f3a9bde0ee4d95d1d2909bcb4/mmv1/third_party/terraform/services/container/resource_container_node_pool_test.go.erb#L3142-L3143

Also not 100% sure about the accuracy of these conditions, and whether the comments and / or test need any additional adjustments (or whether we need to do more provider level validation if the user tries to set disallowed combinations, or just let the API calls fail if something is not valid).

wyardley avatar Sep 11 '24 06:09 wyardley

yeah force pushes that leave the history intact are fine.

melinath avatar Sep 11 '24 21:09 melinath

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 3 insertions(+), 11 deletions(-)) google-beta provider: Diff ( 3 files changed, 3 insertions(+), 11 deletions(-))

Missing test report

Your PR includes resource fields which are not covered by any test.

Resource: google_container_cluster (388 total tests) Please add an acceptance test which includes these fields. The test should include the following:

resource "google_container_cluster" "primary" {
  node_config {
    kubelet_config {
      cpu_manager_policy = # value needed
    }
  }
}


modular-magician avatar Sep 11 '24 21:09 modular-magician

Tests analytics

Total tests: 208 Passed tests: 193 Skipped tests: 13 Affected tests: 2

Click here to see the affected service packages
  • container

Action taken

Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
  • TestAccContainerNodePool_withKubeletConfig

Get to know how VCR tests work

modular-magician avatar Sep 11 '24 22:09 modular-magician

$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$ TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates[Debug log] TestAccContainerNodePool_withKubeletConfig[Debug log]

$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$


$\textcolor{green}{\textsf{All tests passed!}}$

View the build log or the debug log for each test

modular-magician avatar Sep 11 '24 22:09 modular-magician

Sure - I can add it. Do you want it in an existing test or a standalone new test case?

only caveat is that, since we already know that update is broken when it’s used that way, it probably can’t be a test where the resource is updated.

wyardley avatar Sep 12 '24 01:09 wyardley

It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.

It is still present here, right?

https://github.com/GoogleCloudPlatform/magic-modules/blob/4ca0795882d087ae140ceaef479cc9e00c54f82b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.erb#L6692

wyardley avatar Sep 12 '24 01:09 wyardley

Sure - I can add it. Do you want it in an existing test or a standalone new test case?

only caveat is that, since we already know that update is broken when it’s used that way, it probably can’t be a test where the resource is updated.

An existing test would be ideal (but it doesn't have to be one of the ones already modified in this PR).

It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.

It is still present here, right?

https://github.com/GoogleCloudPlatform/magic-modules/blob/4ca0795882d087ae140ceaef479cc9e00c54f82b/mmv1/third_party/terraform/services/container/resource_container_cluster_test.go.erb#L6692

That's node_pool.node_config.kubelet_config (as opposed to just node_config.kubelet_config).

melinath avatar Sep 12 '24 16:09 melinath

An existing test would be ideal (but it doesn't have to be one of the ones already modified in this PR).

@melinath I didn't see any real obvious choice, and the insecure_... one won't work because updates, so I made a new one that includes all the other node_config.kubelet_config settings, with a note? It could be folded into the insecure kubelet one later if someone were so inclined? Checking that it passes now, but let me know if this approach looks Ok to you.

wyardley avatar Sep 12 '24 17:09 wyardley

sure, seems fine

melinath avatar Sep 12 '24 18:09 melinath

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 59 insertions(+), 11 deletions(-)) google-beta provider: Diff ( 3 files changed, 59 insertions(+), 11 deletions(-))

modular-magician avatar Sep 12 '24 18:09 modular-magician

Tests analytics

Total tests: 209 Passed tests: 195 Skipped tests: 13 Affected tests: 1

Click here to see the affected service packages
  • container

Action taken

Found 1 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withNodeConfigKubeletConfigSettings

Get to know how VCR tests work

modular-magician avatar Sep 12 '24 18:09 modular-magician

$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$ TestAccContainerCluster_withNodeConfigKubeletConfigSettings[Debug log]

$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$


$\textcolor{green}{\textsf{All tests passed!}}$

View the build log or the debug log for each test

modular-magician avatar Sep 12 '24 18:09 modular-magician