terraform-google-kubernetes-engine
terraform-google-kubernetes-engine copied to clipboard
fix: Add the nodepool cgroup mode to the NAP config
Closes #2321
Add the nodepool cgroup mode to the NAP config
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
Hi @davidholsgrove - Can you please check the CLA. Thanks!
CLA updated. Thanks @apeabody
/gcbrun
Hi @davidholsgrove - From the CI Tests:
Error: Received unexpected error:
FatalError{Underlying: error while running command: exit status 1;
Error: Error updating LinuxNodeConfig: googleapi: Error 400: Setting cgroup_mode to v1 is not allowed for clusters created with version 1.32.3-gke.1927009.
Details:
[
{
"@type": "type.googleapis.com/google.rpc.RequestInfo",
"requestId": "0x261042fcc5e9d554"
}
]
, badRequest
with module.example.module.gke.google_container_cluster.primary,
on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
22: resource "google_container_cluster" "primary" {
}
Test: TestNodePool
Hi @apeabody
I think the testcase was updated in https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.
I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?
I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.
If the testcase needs to revert this diff, I can include it in my PR?
Hi @apeabody
I think the testcase was updated in #2224 to downgrade from
CGROUP_MODE_V2toCGROUP_MODE_V1.I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?
I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.
If the testcase needs to revert this diff, I can include it in my PR?
Hi @davidholsgrove!
Yes, the API is rejecting when specifically adding to a new the NAP, full removal of CGROUP_MODE_V1 isn't expected till v1.35.
I suggest swapping that values of pool-01 and all (and their test assets). That should fix this test, while maintaining test coverage. https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/main/examples/node_pool/main.tf#L165
/gcbrun
Thanks @apeabody - testcases updated to flip the default to V2, with the exception of pool-01 to continue coverage.
I dont have perms to trigger a retest, could you run that for me and let me know if you need anything else?
/gcbrun
I'll re-run the test again:
Error: Received unexpected error:
FatalError{Underlying: error while running command: exit status 1;
Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1748969188467-eb201d3b-0103-4996-89d9-a6cf6fa5ed15.
Details:
[
{
"@type": "type.googleapis.com/google.rpc.RequestInfo",
"requestId": "0x7d2092479dc8c224"
},
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"domain": "container.googleapis.com",
"reason": "CLUSTER_ALREADY_HAS_OPERATION"
}
]
, failedPrecondition
with module.example.module.gke.google_container_cluster.primary,
on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
22: resource "google_container_cluster" "primary" {
}
Test: TestNodePool
Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?
Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?
Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.
Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?
Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.
Hi @davidholsgrove - I've checked and this error seems to be specific to this change. I'd suggest (temporarily) removing pool-01 = "CGROUP_MODE_V1" as I'm wondering if the linux_node_config is first set as "CGROUP_MODE_V1" by all, the pool-01 can't be later set to "CGROUP_MODE_V1"?
Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase.
Could you trigger again?
/gcbrun
Thanks @apeabody - rebased and removed the explicit setting of
pool-01 = "CGROUP_MODE_V1"from the testcase. Could you trigger again?
Hi @davidholsgrove - Yeah, it's still having the issue:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1749488222583-0c77b4da-5a67-4e0b-94ab-ad5f6a673820.
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Details:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: [
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: "@type": "type.googleapis.com/google.rpc.RequestInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: "requestId": "0x663f464965d3f2fb"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: },
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: "@type": "type.googleapis.com/google.rpc.ErrorInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: "domain": "container.googleapis.com",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: "reason": "CLUSTER_ALREADY_HAS_OPERATION"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: }
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: ]
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: , failedPrecondition
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: with module.example.module.gke.google_container_cluster.primary,
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 22: resource "google_container_cluster" "primary" {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1;
Looking at your change, another possibility is the addition of local.node_pools_cgroup_mode != null to the node_pool_auto_config gating function is also resulting in the network_tags blocks being created. I would suggest making network_tags a dynamic block as well, gated on var.cluster_autoscaling.enabled && (length(var.network_tags) > 0 || var.add_cluster_firewall_rules to eliminate that potential.
Hi @apeabody, Apologies for the delay - thanks for the suggestion, yes makes sense to me. I've rebased to include the v37 / main latest. If you could kick off the tests again that would be great. thanks, David
/gcbrun
Hi @apeabody, Apologies for the delay - thanks for the suggestion, yes makes sense to me. I've rebased to include the v37 / main latest. If you could kick off the tests again that would be great. thanks, David
Great, that got further, this is from the test itself:
Step #55 - "verify node-pool-local": Error: Not equal:
Step #55 - "verify node-pool-local": expected: "EFFECTIVE_CGROUP_MODE_V1"
Step #55 - "verify node-pool-local": actual : "EFFECTIVE_CGROUP_MODE_V2"
Step #55 - "verify node-pool-local":
Step #55 - "verify node-pool-local": Diff:
Step #55 - "verify node-pool-local": --- Expected
Step #55 - "verify node-pool-local": +++ Actual
Step #55 - "verify node-pool-local": @@ -1 +1 @@
Step #55 - "verify node-pool-local": -EFFECTIVE_CGROUP_MODE_V1
Step #55 - "verify node-pool-local": +EFFECTIVE_CGROUP_MODE_V2
Step #55 - "verify node-pool-local": Test: TestNodePool
Step #55 - "verify node-pool-local": Messages: For node "pool-01" path "config.effectiveCgroupMode" expected "EFFECTIVE_CGROUP_MODE_V2" to match fixture "EFFECTIVE_CGROUP_MODE_V1"
Thanks again @apeabody - its hard to guess the test logic here without being able to trigger or see the logs.
I've updated the node_pools_cgroup_mode in examples/node_pool/main.tf to include explicitly setting node pool-01 to CGROUP_MODE_V1 to maintain that test.
Could we try again?
/gcbrun
Thanks again @apeabody - its hard to guess the test logic here without being able to trigger or see the logs.
I've updated the
node_pools_cgroup_modeinexamples/node_pool/main.tfto include explicitly setting nodepool-01toCGROUP_MODE_V1to maintain that test.Could we try again?
Yes, this indicates that the observed actual values does not match the test data here: https://github.com/davidholsgrove/terraform-google-kubernetes-engine/blob/nap-cgroup-mode/test/integration/node_pool/testdata/TestNodePool.json
OK, this error is from the cluster, rather than a specific nodepool:
Step #55 - "verify node-pool-local": Error: Not equal:
Step #55 - "verify node-pool-local": expected: "EFFECTIVE_CGROUP_MODE_V1"
Step #55 - "verify node-pool-local": actual : "EFFECTIVE_CGROUP_MODE_V2"
Step #55 - "verify node-pool-local":
Step #55 - "verify node-pool-local": Diff:
Step #55 - "verify node-pool-local": --- Expected
Step #55 - "verify node-pool-local": +++ Actual
Step #55 - "verify node-pool-local": @@ -1 +1 @@
Step #55 - "verify node-pool-local": -EFFECTIVE_CGROUP_MODE_V1
Step #55 - "verify node-pool-local": +EFFECTIVE_CGROUP_MODE_V2
Step #55 - "verify node-pool-local": Test: TestNodePool
Step #55 - "verify node-pool-local": Messages: For path "nodeConfig.effectiveCgroupMode" expected "EFFECTIVE_CGROUP_MODE_V2" to match fixture "EFFECTIVE_CGROUP_MODE_V1"
Thanks @apeabody - returning to this PR, as we still see the same perma-diff with the latest releases of the module.
I think I've updated the expected result for the cluster nodepool effectiveCgroupMode. Could you rerun the tests?
/gcbrun